[SLL] multi-threading

Bill Warner billw at onedrous.org
Mon Sep 29 08:22:24 PDT 2008


Paul,

You could also partition your data and distribute each partition to a
separate process that writes to a separate file. When all are complete,
recombine the output files.

It's not as elegant or general as Robert's solution, it might require
less refactoring of your current script.

 b

Paul A. Franz, P.E. wrote:
> On Sun, September 21, 2008 5:04 pm, Robert Woodcock wrote:
>   
>> On Sun, Sep 21, 2008 at 04:01:21PM -0700, Paul A. Franz, P.E. wrote:
>>     
>>> I have a script that I'd like to speed up. The problem is that it queries
>>> several hundred different hosts sequentially waiting for the response. I
>>> would like to launch these requests in batches of say 6, or multiples of 6
>>> all at once. I don't need the results of one query in order to do the next
>>> one.
>>>
>>> Is there some simple way to launch multiple processes from within a bash
>>> script?
>>>       
>
> I attempted to describe the problem in the simplest manner but probably would have
> been better to give a more precise definition.
>
> Given a sorted list of IP's that all originated spams during a period, I reverse the
> quads and append the rbl name then do host look ups for each of the 6 rbl's then
> finally a reverse lookup on the IP then repeat for all the IP's in the list generating
> output that looks like this:
>
> Test each of the 236 unique hosts for inclusion with the current selection of 6 RBL's.
>
>    __ not listed   __ bl.spamcop.net
>  /   __ listed   /   __ dnsbl.sorbs.net
> |  /            |  /   __ no-more-funn.moensted.dk
> | |             | |  /   __ zen.spamhaus.org
> 0 1             | | |  /   __ dnsbl.njabl.org
>                 | | | |  /   __  dnsbl-3.uceprotect.net
>                 | | | | |  /
>   Check IP      | | | | | |  Total  Reverse Lookup
> 24.20.246.139   1 1 1 1 0 0 -- 4 -- c-24-20-246-139.hsd1.or.comcast.net.
> 24.197.157.210  0 1 0 1 0 0 -- 2 -- 24-197-157-210.dhcp.gsvl.ga.charter.com.
> 41.201.118.192  1 0 0 1 0 1 -- 3 --
> 41.208.97.129   1 0 1 1 0 0 -- 3 --
> 58.8.105.72     0 1 1 1 0 1 -- 4 -- ppp-58-8-105-72.revip2.asianet.co.th.
> 58.8.160.29     0 1 1 1 0 1 -- 4 -- ppp-58-8-160-29.revip2.asianet.co.th.
> 58.8.166.252    1 1 1 1 0 1 -- 5 -- ppp-58-8-166-252.revip2.asianet.co.th.
> 58.9.224.179    1 1 1 1 0 1 -- 5 -- ppp-58-9-224-179.revip2.asianet.co.th.
> 58.64.52.115    0 0 0 1 0 1 -- 2 --
> 58.141.136.245  1 1 1 1 0 1 -- 5 --
> 60.26.108.101   1 0 1 1 0 0 -- 3 --
> 60.218.99.18    1 0 1 1 0 0 -- 3 --
> 61.106.93.131   1 0 1 1 0 1 -- 4 --
> 61.136.242.36   0 0 1 0 0 0 -- 1 --
> 61.158.157.244  1 1 1 1 0 0 -- 4 --
> 62.1.222.249    1 0 1 1 0 0 -- 3 -- ppp185-249.adsl.forthnet.gr.
> .... and so on for all the IP's in the sorted list. This can take 15 minutes or more
> for a list of 1000 IP's. I'd like to speed it up in a pseudo multi-threading method.
>
> Just for drill here's the portion of code in the script that does that:
>
> echo -e "  Check IP      | | | | | |  Total  Reverse Lookup"
> for host_ip in `cat $TMPFILE2`
>   do
> # debugging
> #echo -e "$host_ip"
> #exit 0
> # what was the ip?
>     a=`host $host_ip` # host lookup on the IP passed
>     b="`echo $host_ip | cut -d. -f4`.`echo $host_ip | cut -d. -f3`.`echo $host_ip |
> cut -d. -f2`.`echo $h
> ost_ip | cut -d. -f1`." # reverse the quads into b
> #    echo -e "$host_ip $b" # print the IP and reversed quad notation (for debugging)
>     echo -n "$host_ip" # don't print \n after the IP
>     let "n=`echo $host_ip | wc | sed -e 's/.* //;q'`"
>     while [ "$n" -le "15" ] # pad the field with blanks until the 15th character
>       do
>         echo -n " "
>         let "n+=1"
>       done
> n=0
> j=0
> while [ "$j" -lt "${#rbl[*]}" ]
>     do
>       if host `echo $b${rbl[$j]}` | grep -v found >> /dev/null # checks for success:
> something, but not "
> found" was found
>         then echo -n " 1" # print 1, IP is blacklisted, no \n
>           let "n+=1"
>           let "count[$j]+=1"
>         else echo -n " 0" # print 0, IP is not blacklisted, no \n
>       fi
>       let "j+=1"
>     done
> # note for future analyses false positives for "0" value for Total is $n for IP $host_ip
> #      if [[ "$n" = "0" ]]; then # make n print red if it's zero
> #        n=`echo -ne "\033[0;31m$n\033[0;30m"`
> #      fi
>     echo -e " -- $n -- `echo $a | cut -d\  -f5 | grep -v NXDOMAIN | grep -v alias |
> grep -v SERVFAIL\
>     | grep -v ^for$ `" # finally print the reverse lookup if it exists.
>   done
> # generate column totals
> echo -e "\n Column totals for each RBL, in order tested."
> i=0
> while [ "$i" -lt "${#count[*]}" ]
>    do
>      echo -n "${count[$i]}"; echo -e "\t${rbl[$i]}"
>      let "i+=1"
>    done
>
> If you see something in the script that you think is poor practice, don't hesitate to
> point it out. A couple of things in your script I didn't know would work. I didn't
> know you could read in commands like you did and have them executed. I also hadn't
> seen the clever infinite loop exiting with break you use. It appears "true" starts
> with a null value which evaluates to 'not false', is why it works.
>
> Your sample is very well done. I need a little more design help to incorporate your
> scheme though.
>
> I wonder if I could get clever using your scheme with "select"? Here's how it is
> described in the bash manual.
>
>
>        select name [ in word ] ; do list ; done
>               The list of words following in is expanded, generating a list of
>               items.  The set of expanded words is  printed  on  the  standard
>               error,  each  preceded  by a number.  If the in word is omitted,
>               the positional parameters are printed  (see  PARAMETERS  below).
>               The  PS3 prompt is then displayed and a line read from the stan-
>               dard input.  If the line consists of a number  corresponding  to
>               one  of  the  displayed  words, then the value of name is set to
>               that word.  If the line is empty, the words and prompt are  dis-
>               played again.  If EOF is read, the command completes.  Any other
>               value read causes name to be set to  null.   The  line  read  is
>               saved  in  the  variable REPLY.  The list is executed after each
>               selection until a break command is executed.  The exit status of
>               select  is the exit status of the last command executed in list,
>               or zero if no commands were executed.
>
>   
>> Any shell solution is going to revolve around the "&" shell operator which
>> runs a task (or subshell) in the background instead of the foreground. I
>> suppose another way would be to use a Makefile with make's -j option.
>>     
>
> OK, that's a hint. I'll look that up and surmise a possible adaptation.
>
>   
>> A batch of 6, then another batch of 6, would be suboptimal because you'd
>> often end up with a single straggler process that has to finish before the
>> next batch gets kicked off.
>>     
>
> Retaining order is going to take some cleverness. I can't think of any way to do it
> other than filling a 2 dimensional array, rather than printing the results
> sequentially as I do now.
>
>   
>> My solution would be to run an increasing number of processes in the
>> background and have each one append to a temp file to let you know it's done
>> (which I think gets us atomicity since we don't care about ordering). Before
>> each one is started, check to see how many tasks are finished, and if
>> (Started - Finished) > MaxProcs, then sleep for a second and re-check:
>>
>> #!/bin/sh
>> MAXPROCS=6
>> STARTED=0
>> COUNTER=$(mktemp /tmp/counter.XXXXXX)
>> # Read list of queries from stdin
>> while read QUERY
>> do
>>   while true
>>   do
>>     FINISHED=$(wc -l < $COUNTER)
>>     RUNNING=$(expr $STARTED - $FINISHED)
>>     if [ $RUNNING -lt $MAXPROCS ]; then
>>       break
>>     fi
>>     sleep 1
>>   done
>>   STARTED=$(expr $STARTED + 1)
>>   (
>>     $QUERY
>>     echo >> $COUNTER
>>   ) &
>> done
>>
>> # Wait for all backgrounded tasks to finish before exiting
>> while [ $FINISHED -lt $STARTED ]
>> do
>>   sleep 1
>>   FINISHED=$(wc -l < $COUNTER)
>> done
>>
>> # Clean up
>> rm -f $COUNTER
>>
>> The script takes a list of commands to run on stdin, runs a maximum of 6 at
>> a time, and doesn't return until they've all finished.
>>
>> If, for example, all of your queries are HTTP queries using wget, you could
>> replace the "$QUERY" line with "wget $QUERY" and pass the script a list of
>> URLs on stdin. You could also have the script read from a file by putting
>> the main while/do/done loop in a subshell and piping a file to it:
>>
>> [...]
>> cat queryfile | (
>> while read QUERY
>> do
>>   [...]
>> done
>> )
>> [...]
>> --
>> Robert Woodcock - rcw at blarg.net
>> "Duct tape: The last refuge of the incompetent... because the competent
>> don't leave it for last."
>> 	-- seen on slashdot
>>
>>     
>
>
>   


More information about the linux-list mailing list