[SLL] multi-threading

Paul A. Franz, P.E. paul at eucleides.com
Mon Sep 29 03:42:53 PDT 2008


On Sun, September 21, 2008 5:04 pm, Robert Woodcock wrote:
> On Sun, Sep 21, 2008 at 04:01:21PM -0700, Paul A. Franz, P.E. wrote:
>> I have a script that I'd like to speed up. The problem is that it queries
>> several hundred different hosts sequentially waiting for the response. I
>> would like to launch these requests in batches of say 6, or multiples of 6
>> all at once. I don't need the results of one query in order to do the next
>> one.
>>
>> Is there some simple way to launch multiple processes from within a bash
>> script?

I attempted to describe the problem in the simplest manner but probably would have
been better to give a more precise definition.

Given a sorted list of IP's that all originated spams during a period, I reverse the
quads and append the rbl name then do host look ups for each of the 6 rbl's then
finally a reverse lookup on the IP then repeat for all the IP's in the list generating
output that looks like this:

Test each of the 236 unique hosts for inclusion with the current selection of 6 RBL's.

   __ not listed   __ bl.spamcop.net
 /   __ listed   /   __ dnsbl.sorbs.net
|  /            |  /   __ no-more-funn.moensted.dk
| |             | |  /   __ zen.spamhaus.org
0 1             | | |  /   __ dnsbl.njabl.org
                | | | |  /   __  dnsbl-3.uceprotect.net
                | | | | |  /
  Check IP      | | | | | |  Total  Reverse Lookup
24.20.246.139   1 1 1 1 0 0 -- 4 -- c-24-20-246-139.hsd1.or.comcast.net.
24.197.157.210  0 1 0 1 0 0 -- 2 -- 24-197-157-210.dhcp.gsvl.ga.charter.com.
41.201.118.192  1 0 0 1 0 1 -- 3 --
41.208.97.129   1 0 1 1 0 0 -- 3 --
58.8.105.72     0 1 1 1 0 1 -- 4 -- ppp-58-8-105-72.revip2.asianet.co.th.
58.8.160.29     0 1 1 1 0 1 -- 4 -- ppp-58-8-160-29.revip2.asianet.co.th.
58.8.166.252    1 1 1 1 0 1 -- 5 -- ppp-58-8-166-252.revip2.asianet.co.th.
58.9.224.179    1 1 1 1 0 1 -- 5 -- ppp-58-9-224-179.revip2.asianet.co.th.
58.64.52.115    0 0 0 1 0 1 -- 2 --
58.141.136.245  1 1 1 1 0 1 -- 5 --
60.26.108.101   1 0 1 1 0 0 -- 3 --
60.218.99.18    1 0 1 1 0 0 -- 3 --
61.106.93.131   1 0 1 1 0 1 -- 4 --
61.136.242.36   0 0 1 0 0 0 -- 1 --
61.158.157.244  1 1 1 1 0 0 -- 4 --
62.1.222.249    1 0 1 1 0 0 -- 3 -- ppp185-249.adsl.forthnet.gr.
.... and so on for all the IP's in the sorted list. This can take 15 minutes or more
for a list of 1000 IP's. I'd like to speed it up in a pseudo multi-threading method.

Just for drill here's the portion of code in the script that does that:

echo -e "  Check IP      | | | | | |  Total  Reverse Lookup"
for host_ip in `cat $TMPFILE2`
  do
# debugging
#echo -e "$host_ip"
#exit 0
# what was the ip?
    a=`host $host_ip` # host lookup on the IP passed
    b="`echo $host_ip | cut -d. -f4`.`echo $host_ip | cut -d. -f3`.`echo $host_ip |
cut -d. -f2`.`echo $h
ost_ip | cut -d. -f1`." # reverse the quads into b
#    echo -e "$host_ip $b" # print the IP and reversed quad notation (for debugging)
    echo -n "$host_ip" # don't print \n after the IP
    let "n=`echo $host_ip | wc | sed -e 's/.* //;q'`"
    while [ "$n" -le "15" ] # pad the field with blanks until the 15th character
      do
        echo -n " "
        let "n+=1"
      done
n=0
j=0
while [ "$j" -lt "${#rbl[*]}" ]
    do
      if host `echo $b${rbl[$j]}` | grep -v found >> /dev/null # checks for success:
something, but not "
found" was found
        then echo -n " 1" # print 1, IP is blacklisted, no \n
          let "n+=1"
          let "count[$j]+=1"
        else echo -n " 0" # print 0, IP is not blacklisted, no \n
      fi
      let "j+=1"
    done
# note for future analyses false positives for "0" value for Total is $n for IP $host_ip
#      if [[ "$n" = "0" ]]; then # make n print red if it's zero
#        n=`echo -ne "\033[0;31m$n\033[0;30m"`
#      fi
    echo -e " -- $n -- `echo $a | cut -d\  -f5 | grep -v NXDOMAIN | grep -v alias |
grep -v SERVFAIL\
    | grep -v ^for$ `" # finally print the reverse lookup if it exists.
  done
# generate column totals
echo -e "\n Column totals for each RBL, in order tested."
i=0
while [ "$i" -lt "${#count[*]}" ]
   do
     echo -n "${count[$i]}"; echo -e "\t${rbl[$i]}"
     let "i+=1"
   done

If you see something in the script that you think is poor practice, don't hesitate to
point it out. A couple of things in your script I didn't know would work. I didn't
know you could read in commands like you did and have them executed. I also hadn't
seen the clever infinite loop exiting with break you use. It appears "true" starts
with a null value which evaluates to 'not false', is why it works.

Your sample is very well done. I need a little more design help to incorporate your
scheme though.

I wonder if I could get clever using your scheme with "select"? Here's how it is
described in the bash manual.


       select name [ in word ] ; do list ; done
              The list of words following in is expanded, generating a list of
              items.  The set of expanded words is  printed  on  the  standard
              error,  each  preceded  by a number.  If the in word is omitted,
              the positional parameters are printed  (see  PARAMETERS  below).
              The  PS3 prompt is then displayed and a line read from the stan-
              dard input.  If the line consists of a number  corresponding  to
              one  of  the  displayed  words, then the value of name is set to
              that word.  If the line is empty, the words and prompt are  dis-
              played again.  If EOF is read, the command completes.  Any other
              value read causes name to be set to  null.   The  line  read  is
              saved  in  the  variable REPLY.  The list is executed after each
              selection until a break command is executed.  The exit status of
              select  is the exit status of the last command executed in list,
              or zero if no commands were executed.

> Any shell solution is going to revolve around the "&" shell operator which
> runs a task (or subshell) in the background instead of the foreground. I
> suppose another way would be to use a Makefile with make's -j option.

OK, that's a hint. I'll look that up and surmise a possible adaptation.

> A batch of 6, then another batch of 6, would be suboptimal because you'd
> often end up with a single straggler process that has to finish before the
> next batch gets kicked off.

Retaining order is going to take some cleverness. I can't think of any way to do it
other than filling a 2 dimensional array, rather than printing the results
sequentially as I do now.

>
> My solution would be to run an increasing number of processes in the
> background and have each one append to a temp file to let you know it's done
> (which I think gets us atomicity since we don't care about ordering). Before
> each one is started, check to see how many tasks are finished, and if
> (Started - Finished) > MaxProcs, then sleep for a second and re-check:
>
> #!/bin/sh
> MAXPROCS=6
> STARTED=0
> COUNTER=$(mktemp /tmp/counter.XXXXXX)
> # Read list of queries from stdin
> while read QUERY
> do
>   while true
>   do
>     FINISHED=$(wc -l < $COUNTER)
>     RUNNING=$(expr $STARTED - $FINISHED)
>     if [ $RUNNING -lt $MAXPROCS ]; then
>       break
>     fi
>     sleep 1
>   done
>   STARTED=$(expr $STARTED + 1)
>   (
>     $QUERY
>     echo >> $COUNTER
>   ) &
> done
>
> # Wait for all backgrounded tasks to finish before exiting
> while [ $FINISHED -lt $STARTED ]
> do
>   sleep 1
>   FINISHED=$(wc -l < $COUNTER)
> done
>
> # Clean up
> rm -f $COUNTER
>
> The script takes a list of commands to run on stdin, runs a maximum of 6 at
> a time, and doesn't return until they've all finished.
>
> If, for example, all of your queries are HTTP queries using wget, you could
> replace the "$QUERY" line with "wget $QUERY" and pass the script a list of
> URLs on stdin. You could also have the script read from a file by putting
> the main while/do/done loop in a subshell and piping a file to it:
>
> [...]
> cat queryfile | (
> while read QUERY
> do
>   [...]
> done
> )
> [...]
> --
> Robert Woodcock - rcw at blarg.net
> "Duct tape: The last refuge of the incompetent... because the competent
> don't leave it for last."
> 	-- seen on slashdot
>


-- 
Paul A. Franz, P.E.
PAF Consulting Engineers
Office 425.440.9505
Cell 425.241.1618


More information about the linux-list mailing list