[SLL] multi-threading
Paul A. Franz, P.E.
paul at eucleides.com
Mon Sep 29 03:42:53 PDT 2008
On Sun, September 21, 2008 5:04 pm, Robert Woodcock wrote:
> On Sun, Sep 21, 2008 at 04:01:21PM -0700, Paul A. Franz, P.E. wrote:
>> I have a script that I'd like to speed up. The problem is that it queries
>> several hundred different hosts sequentially waiting for the response. I
>> would like to launch these requests in batches of say 6, or multiples of 6
>> all at once. I don't need the results of one query in order to do the next
>> one.
>>
>> Is there some simple way to launch multiple processes from within a bash
>> script?
I attempted to describe the problem in the simplest manner but probably would have
been better to give a more precise definition.
Given a sorted list of IP's that all originated spams during a period, I reverse the
quads and append the rbl name then do host look ups for each of the 6 rbl's then
finally a reverse lookup on the IP then repeat for all the IP's in the list generating
output that looks like this:
Test each of the 236 unique hosts for inclusion with the current selection of 6 RBL's.
__ not listed __ bl.spamcop.net
/ __ listed / __ dnsbl.sorbs.net
| / | / __ no-more-funn.moensted.dk
| | | | / __ zen.spamhaus.org
0 1 | | | / __ dnsbl.njabl.org
| | | | / __ dnsbl-3.uceprotect.net
| | | | | /
Check IP | | | | | | Total Reverse Lookup
24.20.246.139 1 1 1 1 0 0 -- 4 -- c-24-20-246-139.hsd1.or.comcast.net.
24.197.157.210 0 1 0 1 0 0 -- 2 -- 24-197-157-210.dhcp.gsvl.ga.charter.com.
41.201.118.192 1 0 0 1 0 1 -- 3 --
41.208.97.129 1 0 1 1 0 0 -- 3 --
58.8.105.72 0 1 1 1 0 1 -- 4 -- ppp-58-8-105-72.revip2.asianet.co.th.
58.8.160.29 0 1 1 1 0 1 -- 4 -- ppp-58-8-160-29.revip2.asianet.co.th.
58.8.166.252 1 1 1 1 0 1 -- 5 -- ppp-58-8-166-252.revip2.asianet.co.th.
58.9.224.179 1 1 1 1 0 1 -- 5 -- ppp-58-9-224-179.revip2.asianet.co.th.
58.64.52.115 0 0 0 1 0 1 -- 2 --
58.141.136.245 1 1 1 1 0 1 -- 5 --
60.26.108.101 1 0 1 1 0 0 -- 3 --
60.218.99.18 1 0 1 1 0 0 -- 3 --
61.106.93.131 1 0 1 1 0 1 -- 4 --
61.136.242.36 0 0 1 0 0 0 -- 1 --
61.158.157.244 1 1 1 1 0 0 -- 4 --
62.1.222.249 1 0 1 1 0 0 -- 3 -- ppp185-249.adsl.forthnet.gr.
.... and so on for all the IP's in the sorted list. This can take 15 minutes or more
for a list of 1000 IP's. I'd like to speed it up in a pseudo multi-threading method.
Just for drill here's the portion of code in the script that does that:
echo -e " Check IP | | | | | | Total Reverse Lookup"
for host_ip in `cat $TMPFILE2`
do
# debugging
#echo -e "$host_ip"
#exit 0
# what was the ip?
a=`host $host_ip` # host lookup on the IP passed
b="`echo $host_ip | cut -d. -f4`.`echo $host_ip | cut -d. -f3`.`echo $host_ip |
cut -d. -f2`.`echo $h
ost_ip | cut -d. -f1`." # reverse the quads into b
# echo -e "$host_ip $b" # print the IP and reversed quad notation (for debugging)
echo -n "$host_ip" # don't print \n after the IP
let "n=`echo $host_ip | wc | sed -e 's/.* //;q'`"
while [ "$n" -le "15" ] # pad the field with blanks until the 15th character
do
echo -n " "
let "n+=1"
done
n=0
j=0
while [ "$j" -lt "${#rbl[*]}" ]
do
if host `echo $b${rbl[$j]}` | grep -v found >> /dev/null # checks for success:
something, but not "
found" was found
then echo -n " 1" # print 1, IP is blacklisted, no \n
let "n+=1"
let "count[$j]+=1"
else echo -n " 0" # print 0, IP is not blacklisted, no \n
fi
let "j+=1"
done
# note for future analyses false positives for "0" value for Total is $n for IP $host_ip
# if [[ "$n" = "0" ]]; then # make n print red if it's zero
# n=`echo -ne "\033[0;31m$n\033[0;30m"`
# fi
echo -e " -- $n -- `echo $a | cut -d\ -f5 | grep -v NXDOMAIN | grep -v alias |
grep -v SERVFAIL\
| grep -v ^for$ `" # finally print the reverse lookup if it exists.
done
# generate column totals
echo -e "\n Column totals for each RBL, in order tested."
i=0
while [ "$i" -lt "${#count[*]}" ]
do
echo -n "${count[$i]}"; echo -e "\t${rbl[$i]}"
let "i+=1"
done
If you see something in the script that you think is poor practice, don't hesitate to
point it out. A couple of things in your script I didn't know would work. I didn't
know you could read in commands like you did and have them executed. I also hadn't
seen the clever infinite loop exiting with break you use. It appears "true" starts
with a null value which evaluates to 'not false', is why it works.
Your sample is very well done. I need a little more design help to incorporate your
scheme though.
I wonder if I could get clever using your scheme with "select"? Here's how it is
described in the bash manual.
select name [ in word ] ; do list ; done
The list of words following in is expanded, generating a list of
items. The set of expanded words is printed on the standard
error, each preceded by a number. If the in word is omitted,
the positional parameters are printed (see PARAMETERS below).
The PS3 prompt is then displayed and a line read from the stan-
dard input. If the line consists of a number corresponding to
one of the displayed words, then the value of name is set to
that word. If the line is empty, the words and prompt are dis-
played again. If EOF is read, the command completes. Any other
value read causes name to be set to null. The line read is
saved in the variable REPLY. The list is executed after each
selection until a break command is executed. The exit status of
select is the exit status of the last command executed in list,
or zero if no commands were executed.
> Any shell solution is going to revolve around the "&" shell operator which
> runs a task (or subshell) in the background instead of the foreground. I
> suppose another way would be to use a Makefile with make's -j option.
OK, that's a hint. I'll look that up and surmise a possible adaptation.
> A batch of 6, then another batch of 6, would be suboptimal because you'd
> often end up with a single straggler process that has to finish before the
> next batch gets kicked off.
Retaining order is going to take some cleverness. I can't think of any way to do it
other than filling a 2 dimensional array, rather than printing the results
sequentially as I do now.
>
> My solution would be to run an increasing number of processes in the
> background and have each one append to a temp file to let you know it's done
> (which I think gets us atomicity since we don't care about ordering). Before
> each one is started, check to see how many tasks are finished, and if
> (Started - Finished) > MaxProcs, then sleep for a second and re-check:
>
> #!/bin/sh
> MAXPROCS=6
> STARTED=0
> COUNTER=$(mktemp /tmp/counter.XXXXXX)
> # Read list of queries from stdin
> while read QUERY
> do
> while true
> do
> FINISHED=$(wc -l < $COUNTER)
> RUNNING=$(expr $STARTED - $FINISHED)
> if [ $RUNNING -lt $MAXPROCS ]; then
> break
> fi
> sleep 1
> done
> STARTED=$(expr $STARTED + 1)
> (
> $QUERY
> echo >> $COUNTER
> ) &
> done
>
> # Wait for all backgrounded tasks to finish before exiting
> while [ $FINISHED -lt $STARTED ]
> do
> sleep 1
> FINISHED=$(wc -l < $COUNTER)
> done
>
> # Clean up
> rm -f $COUNTER
>
> The script takes a list of commands to run on stdin, runs a maximum of 6 at
> a time, and doesn't return until they've all finished.
>
> If, for example, all of your queries are HTTP queries using wget, you could
> replace the "$QUERY" line with "wget $QUERY" and pass the script a list of
> URLs on stdin. You could also have the script read from a file by putting
> the main while/do/done loop in a subshell and piping a file to it:
>
> [...]
> cat queryfile | (
> while read QUERY
> do
> [...]
> done
> )
> [...]
> --
> Robert Woodcock - rcw at blarg.net
> "Duct tape: The last refuge of the incompetent... because the competent
> don't leave it for last."
> -- seen on slashdot
>
--
Paul A. Franz, P.E.
PAF Consulting Engineers
Office 425.440.9505
Cell 425.241.1618
More information about the linux-list
mailing list