Condor tutorial for Condor Boot Camp
Master Worker
At this point, you should have finished the Condor exercises and the Search for Knowledge exercises. If everything went smoothly for you, if you have extra time, and if you feel comfortable with C++ on Linux, you can continue on to learn about MW. Because C++ is not a prerequisite for this class, we do not expect you to go through this, but we want to provide it as an option for those of you that are interested.
Getting Ready
Master Worker (MW for short) is an addition to Condor: it is not provided with Condor but is an extra download. It has no hidden knowledge of Condor, but is built on top of Condor using public interfaces. The first thing to do is to download MW into your home directory:
% cd ~ % wget http://www.cs.wisc.edu/condor/mw/mw-0.2.tar.gz
We apologize in advance that the documentation for MW is rather light. But you can read what there is online.
Compiling MW
In your home directory, first extract the MW source code and rename the directory to mw, then make a directory to install mw into:
% tar xzf mw-0.2.tar.gz % mv mw mw-src % mkdir ~/mw
Now configure and build MW. Make sure the CONDOR_CONFIG is properly
set and you can find the Condor binaries in
/opt/condor. We're going to build it without PVM,
but we'll use the socket implementation. From a high-level
perspective, it doesn't make a difference which you use. The socket
implementation is slightly less capable, but is much easier to use and
debug if there are problems. The entire configure/make process should
just take a couple of minutes. Make sure you edit
the prefix that you give to configure Note that because your
home directory is on NFS, it may build slowly.
% which condor_version
/opt/condor/bin/condor_version
% echo $CONDOR_CONFIG
/opt/condor/etc/condor_config
% cd mw-src
% ./configure --with-condor=/opt/condor \
--prefix=~/mw \
--without-pvm
checking for g++... g++
checking for C++ compiler default output... a.out
checking whether the C++ compiler works... yes
checking whether we are cross compiling... no
checking for suffix of executables...
checking for suffix of object files... o
[output trimmed...]
% make
[ "__src examples" = "__" ] || for subdir in `echo "src examples"`;
do (cd $subdir && make all) ; done
make[1]: Entering directory `/gine/roy/mw-src/src'
/usr/bin/g++ -DPACKAGE_NAME=\"\"
-DPACKAGE_TARNAME=\"\" -DPACKAGE_VERSION=\"\" -DPACKAGE_STRING=\"\"
-DPACKAGE_BUGREPORT=\"\" -DSTDC_HEADERS=1 -DHAVE_SYS_TYPES_H=1
-DHAVE_SYS_STAT_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1
-DHAVE_MEMORY_H=1 -DHAVE_STRINGS_H=1 -DHAVE_INTTYPES_H=1
-DHAVE_STDINT_H=1 -DHAVE_UNISTD_H=1 -DHAVE_FCNTL_H=1 -DHAVE_LIMITS_H=1
-DHAVE_SYS_TIME_H=1 -DHAVE_UNISTD_H=1 -DTIME_WITH_SYS_TIME=1
-DHAVE_VPRINTF=1 -DHAVE_GETCWD=1 -DHAVE_GETHOSTNAME=1
-DHAVE_GETTIMEOFDAY=1 -DHAVE_MKDIR=1 -DHAVE_STRSTR=1
-DHAVE_DYNAMIC_CAST=
-DCONDOR_DIR=\"/opt/condor\" -I. -I. -IRMComm
-IMW-File -IMW-CondorPVM -IMW-Socket -IMWControlTasks -g -O2
-Wall -c MW.C
[output trimmed...]
% make install
[ "__src examples" = "__" ] || for subdir in `echo "src examples"`;
do (cd $subdir && make install) ; done
make[1]: Entering directory `/home/roy//mw-src/src'
/bin/sh ../mkinstalldirs /home/roy/mw/lib
mkdir /home/roy/school/mw/lib /usr/bin/install -c -m 644 libMW.a
/home/roy/mw/lib/libMW.a
[output trimmed...]
Assuming you don't see any errors, you're set to go!
The examples
MW has provided several examples. They are all in the mw-src/examples directory.
% cd examples % ls Makefile Makefile.in fib/ knapsack/ matmul/ n-queens/ newmatmul/ newskel/ skel/
- fib: calcluate fibonacci numbers
- knapsack: incomlete examples. Solve the knapsack problem with branch and bound.
- matmul: multiply two matrices
- n-queens: Find a chessboard with N queens on it so that no two queens attack each other.
- newmatmul: Ignore this bad example of matrix multiplication.
- newskel: An empty shell for an MW application
- skel: An empty shell for an MW application
Trying an example in independent mode
MW can run applications within Condor, but it can also run them
without Condor, just on your computer. This will only create a single
worker, which will execute the tasks serially. This can be easier to try out
and easier to debug. Let run matmaul in independent mode. The
matrices to be multiplied are in the file named in_master.
% cd matmaul % cat in_master 1 1 workermatmul_condorpvm.LINUX 0 10 10 10 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 % ./mastermatmul_indp < in_master 10:42:52 MWDriver is pid 32316. 10:42:52 Starting from the beginning. 10:42:52 argc=1, argv[0]=./mastermatmul_indp 10:42:52 workermatmul_condorpvm.LINUX 10:42:52 tempnum_executables = 0 10:42:52 Good to go. 10:42:52 num_TODO = 10, num_run = 0, num_done = 0 10:42:52 CONTINUE -- todo list has at least task number: 10 [output trimmed...] 10:42:55 The resulting Matrix is as follows 10:42:55 0 45 90 135 180 225 270 315 360 405 10:42:55 0 45 90 135 180 225 270 315 360 405 10:42:55 0 45 90 135 180 225 270 315 360 405 10:42:55 0 45 90 135 180 225 270 315 360 405 10:42:55 0 45 90 135 180 225 270 315 360 405 10:42:55 0 45 90 135 180 225 270 315 360 405 10:42:55 0 45 90 135 180 225 270 315 360 405 10:42:55 0 45 90 135 180 225 270 315 360 405 10:42:55 0 45 90 135 180 225 270 315 360 405 10:42:55 0 45 90 135 180 225 270 315 360 405 [output trimmed...] 10:42:55 Killing workers: 10:42:55 MWList::Can't remove any element from empty list. 10:42:55 MWList::Can't remove any element from empty list. 10:42:55 MWList::Can't remove any element from empty list. 10:42:55 MWList::Can't remove any element from empty list.Ignore those error messages at the end (Can't remove any element...).
Congratulations! You've successfully run your first MW job, albeit a simple one.
Trying an example as a Condor job
The submit file for the matmul example is
submit_socket. Theoretically you could use
submit_pvm but we don't have PVM installed. You could
also use submit_file which uses Condor's standard
universe, but there is not particular advantage for our short-running
job.
Look at submit_socket:
# Now we're in the scheduler universe universe = Scheduler # The name of our executable Executable = mastermatmul_socket # Assume a max image size of 16 Megabytes. Image_Size = 4 Meg +MemoryRequirements = 4 # This goes into stdin for the master. Input = in_master.socket # Set the output of this job to go to out_master Output = out_master.socket # Set the stderr of this job to go to out_worker. It is named # out_worker because the output of the workers is directed to stderr Error = out_worker.socket # Keep a log in case of problems. Log = work.log notify_user = chang@cs.wisc.edu QueueNotice two things about this submit file:
- Change the notify_user line to be correct for you.
- This is a scheduler universe job. We haven't talked about those very much. It's a job that runs on the submit computer as soon as you submit it. You get all the benefits of Condor (reliability, logging, etc) with a job that executes locally. We use it for DAGMan and MW: it is a job that submits other jobs and watches over them. In this case, it will be master, which will spawn the other workers (as jobs) and will send them their tasks.
Now submit the job and watch it run:
% condor_submit submit_socket Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 1. % condor_q -- Submitter: chopin.cs.wisc.edu : <128.105.121.21:50689> : chopin.cs.wisc.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 roy 7/5 10:51 0+00:00:00 R 0 4.0 mastermatmul_socke 1 jobs; 0 idle, 1 running, 0 held % condor_q -- Submitter: chopin.cs.wisc.edu : <128.105.121.21:50689> : chopin.cs.wisc.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 roy 7/5 10:51 0+00:00:01 R 0 4.0 mastermatmul_socke 2.0 roy 7/5 10:51 0+00:00:00 I 0 0.0 mw_exec0.$$(Opsys) 2.1 roy 7/5 10:51 0+00:00:00 I 0 0.0 mw_exec0.$$(Opsys) 2.2 roy 7/5 10:51 0+00:00:00 I 0 0.0 mw_exec0.$$(Opsys) 2.3 roy 7/5 10:51 0+00:00:00 I 0 0.0 mw_exec0.$$(Opsys) 2.4 roy 7/5 10:51 0+00:00:00 I 0 0.0 mw_exec0.$$(Opsys) 2.5 roy 7/5 10:51 0+00:00:00 I 0 0.0 mw_exec0.$$(Opsys) % condor_q -- Submitter: chopin.cs.wisc.edu : <128.105.121.21:50689> : chopin.cs.wisc.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 roy 7/5 10:51 0+00:00:26 R 0 4.0 mastermatmul_socke 2.0 roy 7/5 10:51 0+00:00:03 R 0 0.0 mw_exec0.$$(Opsys) 2.1 roy 7/5 10:51 0+00:00:01 R 0 0.0 mw_exec0.$$(Opsys) 2.2 roy 7/5 10:51 0+00:00:00 I 0 0.0 mw_exec0.$$(Opsys) 2.3 roy 7/5 10:51 0+00:00:00 I 0 0.0 mw_exec0.$$(Opsys) 2.4 roy 7/5 10:51 0+00:00:00 I 0 0.0 mw_exec0.$$(Opsys) 2.5 roy 7/5 10:51 0+00:00:00 I 0 0.0 mw_exec0.$$(Opsys) % condor_q -- Submitter: chopin.cs.wisc.edu : <128.105.121.21:50689> : chopin.cs.wisc.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 0 jobs; 0 idle, 0 running, 0 heldWe saw the master submit six workers. Two of them started to run, and they did all of the work. Look at out_master.socket to see the result of the run:
% cat out_master.socket 10:51:19 MWDriver is pid 507. 10:51:19 Socket bound to port: 8997 10:51:19 Starting from the beginning. 10:51:19 argc=1, argv[0]=condor_scheduniv_exec.1.0 10:51:19 workermatmul_socket 10:51:19 tempnum_executables = 0 10:51:19 Making a link from workermatmul_socket to mw_exec0.LINUX.INTEL 10:51:19 In MWSocketRC::init_beginning_workers() 10:51:19 Good to go. [output trimmed...]If you look at the output carefully, you'll notice that only one worker did all of the tasks. That is because the time to do the tasks in this simple case was really short.
It's your turn
Now that you've tried out the basics, we'll let you explore by yourself. Here are some ideas:
- Make a task take a longer time (perhaps a sleep() to artificially inflate the time). Does MW use multiple workers properly?
- Write an MW program (or modify the matmul example) to have a task that does nothing.
- How long does it take to run 10,000 tasks?
- Estimate how long it would take to run 10,000 Condor jobs that do /bin/sleep 0.
RMC->set_target_num_workers( target_num_workers );