[SLL] multi-threading
Bill Warner
billw at onedrous.org
Mon Sep 29 08:22:24 PDT 2008
Paul,
You could also partition your data and distribute each partition to a
separate process that writes to a separate file. When all are complete,
recombine the output files.
It's not as elegant or general as Robert's solution, it might require
less refactoring of your current script.
b
Paul A. Franz, P.E. wrote:
> On Sun, September 21, 2008 5:04 pm, Robert Woodcock wrote:
>
>> On Sun, Sep 21, 2008 at 04:01:21PM -0700, Paul A. Franz, P.E. wrote:
>>
>>> I have a script that I'd like to speed up. The problem is that it queries
>>> several hundred different hosts sequentially waiting for the response. I
>>> would like to launch these requests in batches of say 6, or multiples of 6
>>> all at once. I don't need the results of one query in order to do the next
>>> one.
>>>
>>> Is there some simple way to launch multiple processes from within a bash
>>> script?
>>>
>
> I attempted to describe the problem in the simplest manner but probably would have
> been better to give a more precise definition.
>
> Given a sorted list of IP's that all originated spams during a period, I reverse the
> quads and append the rbl name then do host look ups for each of the 6 rbl's then
> finally a reverse lookup on the IP then repeat for all the IP's in the list generating
> output that looks like this:
>
> Test each of the 236 unique hosts for inclusion with the current selection of 6 RBL's.
>
> __ not listed __ bl.spamcop.net
> / __ listed / __ dnsbl.sorbs.net
> | / | / __ no-more-funn.moensted.dk
> | | | | / __ zen.spamhaus.org
> 0 1 | | | / __ dnsbl.njabl.org
> | | | | / __ dnsbl-3.uceprotect.net
> | | | | | /
> Check IP | | | | | | Total Reverse Lookup
> 24.20.246.139 1 1 1 1 0 0 -- 4 -- c-24-20-246-139.hsd1.or.comcast.net.
> 24.197.157.210 0 1 0 1 0 0 -- 2 -- 24-197-157-210.dhcp.gsvl.ga.charter.com.
> 41.201.118.192 1 0 0 1 0 1 -- 3 --
> 41.208.97.129 1 0 1 1 0 0 -- 3 --
> 58.8.105.72 0 1 1 1 0 1 -- 4 -- ppp-58-8-105-72.revip2.asianet.co.th.
> 58.8.160.29 0 1 1 1 0 1 -- 4 -- ppp-58-8-160-29.revip2.asianet.co.th.
> 58.8.166.252 1 1 1 1 0 1 -- 5 -- ppp-58-8-166-252.revip2.asianet.co.th.
> 58.9.224.179 1 1 1 1 0 1 -- 5 -- ppp-58-9-224-179.revip2.asianet.co.th.
> 58.64.52.115 0 0 0 1 0 1 -- 2 --
> 58.141.136.245 1 1 1 1 0 1 -- 5 --
> 60.26.108.101 1 0 1 1 0 0 -- 3 --
> 60.218.99.18 1 0 1 1 0 0 -- 3 --
> 61.106.93.131 1 0 1 1 0 1 -- 4 --
> 61.136.242.36 0 0 1 0 0 0 -- 1 --
> 61.158.157.244 1 1 1 1 0 0 -- 4 --
> 62.1.222.249 1 0 1 1 0 0 -- 3 -- ppp185-249.adsl.forthnet.gr.
> .... and so on for all the IP's in the sorted list. This can take 15 minutes or more
> for a list of 1000 IP's. I'd like to speed it up in a pseudo multi-threading method.
>
> Just for drill here's the portion of code in the script that does that:
>
> echo -e " Check IP | | | | | | Total Reverse Lookup"
> for host_ip in `cat $TMPFILE2`
> do
> # debugging
> #echo -e "$host_ip"
> #exit 0
> # what was the ip?
> a=`host $host_ip` # host lookup on the IP passed
> b="`echo $host_ip | cut -d. -f4`.`echo $host_ip | cut -d. -f3`.`echo $host_ip |
> cut -d. -f2`.`echo $h
> ost_ip | cut -d. -f1`." # reverse the quads into b
> # echo -e "$host_ip $b" # print the IP and reversed quad notation (for debugging)
> echo -n "$host_ip" # don't print \n after the IP
> let "n=`echo $host_ip | wc | sed -e 's/.* //;q'`"
> while [ "$n" -le "15" ] # pad the field with blanks until the 15th character
> do
> echo -n " "
> let "n+=1"
> done
> n=0
> j=0
> while [ "$j" -lt "${#rbl[*]}" ]
> do
> if host `echo $b${rbl[$j]}` | grep -v found >> /dev/null # checks for success:
> something, but not "
> found" was found
> then echo -n " 1" # print 1, IP is blacklisted, no \n
> let "n+=1"
> let "count[$j]+=1"
> else echo -n " 0" # print 0, IP is not blacklisted, no \n
> fi
> let "j+=1"
> done
> # note for future analyses false positives for "0" value for Total is $n for IP $host_ip
> # if [[ "$n" = "0" ]]; then # make n print red if it's zero
> # n=`echo -ne "\033[0;31m$n\033[0;30m"`
> # fi
> echo -e " -- $n -- `echo $a | cut -d\ -f5 | grep -v NXDOMAIN | grep -v alias |
> grep -v SERVFAIL\
> | grep -v ^for$ `" # finally print the reverse lookup if it exists.
> done
> # generate column totals
> echo -e "\n Column totals for each RBL, in order tested."
> i=0
> while [ "$i" -lt "${#count[*]}" ]
> do
> echo -n "${count[$i]}"; echo -e "\t${rbl[$i]}"
> let "i+=1"
> done
>
> If you see something in the script that you think is poor practice, don't hesitate to
> point it out. A couple of things in your script I didn't know would work. I didn't
> know you could read in commands like you did and have them executed. I also hadn't
> seen the clever infinite loop exiting with break you use. It appears "true" starts
> with a null value which evaluates to 'not false', is why it works.
>
> Your sample is very well done. I need a little more design help to incorporate your
> scheme though.
>
> I wonder if I could get clever using your scheme with "select"? Here's how it is
> described in the bash manual.
>
>
> select name [ in word ] ; do list ; done
> The list of words following in is expanded, generating a list of
> items. The set of expanded words is printed on the standard
> error, each preceded by a number. If the in word is omitted,
> the positional parameters are printed (see PARAMETERS below).
> The PS3 prompt is then displayed and a line read from the stan-
> dard input. If the line consists of a number corresponding to
> one of the displayed words, then the value of name is set to
> that word. If the line is empty, the words and prompt are dis-
> played again. If EOF is read, the command completes. Any other
> value read causes name to be set to null. The line read is
> saved in the variable REPLY. The list is executed after each
> selection until a break command is executed. The exit status of
> select is the exit status of the last command executed in list,
> or zero if no commands were executed.
>
>
>> Any shell solution is going to revolve around the "&" shell operator which
>> runs a task (or subshell) in the background instead of the foreground. I
>> suppose another way would be to use a Makefile with make's -j option.
>>
>
> OK, that's a hint. I'll look that up and surmise a possible adaptation.
>
>
>> A batch of 6, then another batch of 6, would be suboptimal because you'd
>> often end up with a single straggler process that has to finish before the
>> next batch gets kicked off.
>>
>
> Retaining order is going to take some cleverness. I can't think of any way to do it
> other than filling a 2 dimensional array, rather than printing the results
> sequentially as I do now.
>
>
>> My solution would be to run an increasing number of processes in the
>> background and have each one append to a temp file to let you know it's done
>> (which I think gets us atomicity since we don't care about ordering). Before
>> each one is started, check to see how many tasks are finished, and if
>> (Started - Finished) > MaxProcs, then sleep for a second and re-check:
>>
>> #!/bin/sh
>> MAXPROCS=6
>> STARTED=0
>> COUNTER=$(mktemp /tmp/counter.XXXXXX)
>> # Read list of queries from stdin
>> while read QUERY
>> do
>> while true
>> do
>> FINISHED=$(wc -l < $COUNTER)
>> RUNNING=$(expr $STARTED - $FINISHED)
>> if [ $RUNNING -lt $MAXPROCS ]; then
>> break
>> fi
>> sleep 1
>> done
>> STARTED=$(expr $STARTED + 1)
>> (
>> $QUERY
>> echo >> $COUNTER
>> ) &
>> done
>>
>> # Wait for all backgrounded tasks to finish before exiting
>> while [ $FINISHED -lt $STARTED ]
>> do
>> sleep 1
>> FINISHED=$(wc -l < $COUNTER)
>> done
>>
>> # Clean up
>> rm -f $COUNTER
>>
>> The script takes a list of commands to run on stdin, runs a maximum of 6 at
>> a time, and doesn't return until they've all finished.
>>
>> If, for example, all of your queries are HTTP queries using wget, you could
>> replace the "$QUERY" line with "wget $QUERY" and pass the script a list of
>> URLs on stdin. You could also have the script read from a file by putting
>> the main while/do/done loop in a subshell and piping a file to it:
>>
>> [...]
>> cat queryfile | (
>> while read QUERY
>> do
>> [...]
>> done
>> )
>> [...]
>> --
>> Robert Woodcock - rcw at blarg.net
>> "Duct tape: The last refuge of the incompetent... because the competent
>> don't leave it for last."
>> -- seen on slashdot
>>
>>
>
>
>
More information about the linux-list
mailing list