7

I am using scrapy to fetch some resources, and I want to make it a cron job which can start every 30 minutes.

The cron job:

0,30 * * * * /home/us/jobs/run_scrapy.sh`

run_scrapy.sh:

#!/bin/sh
cd ~/spiders/goods
PATH=$PATH:/usr/local/bin
export PATH
pkill -f $(pgrep run_scrapy.sh | grep -v $$)
sleep 2s
scrapy crawl good

As the script shows I tried to kill the script process and the child process (scrapy) also.

However when I tried running two instances of the script, the newer instance does not kill the older one.

How to fix that?


Update:

I have more than one .sh scrapy script which run at different frequency configured in cron.


Update 2 - Test for Serg's answer:

All the cron jobs have been stopped before I run the test.

Then I open three terminal windows say they are named w1 w2 and w3, and run the commands in the following orders:

Run `pgrep scrapy` in w3, which print none.(means no scrapy running at the moment).

Run ./scrapy_wrapper.sh in w1

Run pgrep scrapy in w3 which print one process id say it is 1234(means scrapy have been started by the script)

Run ./scrapy_wrapper.sh in w2 #check the w1 and found the script have been terminated.

Run pgrep scrapy in w3 which print two process id 1234 and 5678

Press <kbd>Ctrl</kbd>+<kbd>C</kbd> in w2 (twice)

Run pgrep scrapy in w3 which print one process id 1234 (means scrapy of 5678 have been stopped)

At this moment, I have to use pkill scrapy to stop scrapy with id of 1234

Zanna
  • 70,465
hguser
  • 295
  • the answer,with good explanation, might be there already: http://stackoverflow.com/questions/9117507/linux-unix-command-to-determine-if-process-is-running http://stackoverflow.com/questions/2366693/run-cron-job-only-if-it-isnt-already-running – Malkavian Aug 12 '16 at 18:13

9 Answers9

9

Better approach would be to use a wrapper script, that will call the main script. This would look like this:

#!/bin/bash
# This is /home/user/bin/wrapper.sh file
pkill -f 'main_script.sh'
exec bash ./main_script.sh

Of course wrapper has to be named differently. That way, pkill can search only for your main script. This way your main script reduces to this:

#!/bin/sh
cd /home/user/spiders/goods
PATH=$PATH:/usr/local/bin
export PATH
scrapy crawl good

Note that in my example I am using ./ because script was in my current working directory. Use full path to your script for best results

I have tested this approach with a simple main script that just runs infinite while loop and wrapper script. As you can see in screenshot, launching second instance of wrapper kills previous

enter image description here

Your script

This is just example. Remember that I have no access to scrapy to actually test this so adjust this as needed for your situation.

Your cron entry should look like this:

0,30 * * * * /home/us/jobs/scrapy_wrapper.sh

Contents of scrapy_wrapper.sh

#!/bin/bash
pkill -f 'run_scrapy.sh'
exec sh /home/us/jobs/run_scrapy.sh

Contents of run_scrapy.sh

#!/bin/bash
cd /home/user/spiders/goods
PATH=$PATH:/usr/local/bin
export PATH
# sleep delay now is not necessary
# but uncomment if you think it is
# sleep 2
scrapy crawl good
Arronical
  • 19,893
Sergiy Kolodyazhnyy
  • 105,154
  • 20
  • 279
  • 497
  • I can not use the killall scrapy since there are other scrapy related cron job. See my update – hguser Aug 07 '16 at 09:36
  • @hguser have you tried just pkill -f run_scrapy.sh ? Try exec scrapy crawl good but with pkill -f 'scrapy crawl good' ? Also, is there anything else that you did not tell us ? The fact that there are other processes would have been nice to know from beginning. – Sergiy Kolodyazhnyy Aug 07 '16 at 15:42
  • I have tried pkill -f run_scrapy.sh, but this will kill the current running script. – hguser Aug 08 '16 at 01:45
  • Alright. I'll dig a little more, will see what i can find. I remember doing something similar before. – Sergiy Kolodyazhnyy Aug 08 '16 at 01:57
  • 1
    @Serg pkill and pgrep have -n and -o options, which allow to select the newest and the oldest processes. Maybe they can prove useful in your script? – whtyger Aug 08 '16 at 07:35
  • I've updated my answer, please see it – Sergiy Kolodyazhnyy Aug 10 '16 at 07:43
  • Thanks, I have thought the wrap method like you posted, I prefer to a single script. :) But it seems that the wrapper manner is the only choice. Fine, I will try that. – hguser Aug 10 '16 at 10:53
  • I tried that. It do terminate the former run_scrapy.sh while the scrapy process started by the script is not closed. – hguser Aug 10 '16 at 11:00
  • Are you sure it is the same process that was started by the script ? you said you have other jobs related to scrapy , did you test this approach with others running? – Sergiy Kolodyazhnyy Aug 10 '16 at 15:24
  • Yes, I am sure of that, I have post my test steps in the update of my origin post. Please have a check. – hguser Aug 11 '16 at 00:26
  • I saw your edit. Very strange indeed. I think scrapy somehow forks itself out of the parent shell. Can you make a simple script and where only one line is scrappy crawl good , and see if that shell script stays open or quits ? – Sergiy Kolodyazhnyy Aug 11 '16 at 01:18
  • Create a single script named t.sh with one line scrapy crawl good, and run the ./t.sh, it does not quit, it will show the logs by scrapy. The pgrep scrapy can print the pid of scrapy. Once press Ctrl+C to stop the t.sh, the scrapy is stopped too. – hguser Aug 11 '16 at 05:37
4

If I understand what you are doing correctly, you want to call a process every 30 minutes (via cron). However, of when you start a new process via cron, you want to kill any existing versions still running?

You could use the "timeout" command to ensure that if scrappy if forced to terminate if it is still running after 30 minutes.

This would make your script look like this:

#!/bin/sh
cd ~/spiders/goods
PATH=$PATH:/usr/local/bin
export PATH
timeout 30m scrapy crawl good

note the timeout added in the last line

I have set the duration to "30m" (30 minutes). You might want to choose a slightly shorter time (say 29m) to ensure that the process has terminated before the next job starts.

Note that if you change the spawn interval in crontab, you will have to edit the script as well

Sergiy Kolodyazhnyy
  • 105,154
  • 20
  • 279
  • 497
Nick Sillito
  • 1,596
  • Good one ! Wish I could upvote this a few times. Note though, that cd ~/some/dir probably isn't good idea, because cron joobs run as root, right ? so ~ will refer to root's home, not user's directory – Sergiy Kolodyazhnyy Aug 12 '16 at 20:18
2

Great. A little update which allows for the script to determine itself its own filename without hardcoding:

#!/bin/bash 
# runchecker.sh
#this script obtains the name of the script and then
#checks if the script is already running or not
#if scripts already runs it exits

filename=$(basename $0) echo running now $filename

pids=($(pidof -x $filename))

if [ ${#pids[@]} -gt 1 ] ; then echo "Script already running by pid ${pids[1]}" exit fi

echo "Starting service " sleep 1000

enter code here

RFV-370
  • 21
1

As pkill terminates only the specified process, we should terminate its child subprocesses using -P option. So the modified script will look like this:

#!/bin/sh

cd /home/USERNAME/spiders/goods
PATH=$PATH:/usr/local/bin
export PATH
PID=$(pgrep -o run_scrapy.sh)
if [ $$ -ne $PID ] ; then pkill -P $PID ; sleep 2s ; fi
scrapy crawl good

trap runs the defined command (in double quotes) on event EXIT, i.e. when run_scrapy.sh is terminated. There are other events, you'll find them in help trap.
pgrep -o finds the oldest instance of the process with the defined name.

P.S. Your idea with grep -v $$ is good, but it won't return you the PID of other instance of run_scrapy.sh, because $$ will be the PID of the subprocess $(pgrep run_scrapy.sh | grep -v $$), not the PID of run_scrapy.sh which started it. That's why I used another approach.
P.P.S. You'll find some other methods of terminating subprocesses in Bash here.

whtyger
  • 5,810
  • 3
  • 33
  • 46
  • It does not work. Even it can terminate the former run_scrapy.sh, but it can not terminate the child process like scrapy started by the script. – hguser Aug 10 '16 at 03:03
  • @hguser Does pkill -f 'scrapy crawl good' (started from other console) terminate running scrapy process? If no, then pkill scrapy should work (it will terminate all instances of scrapy). Replace trap line with trap "pkill scrapy" EXIT and that's it. – whtyger Aug 10 '16 at 09:19
  • pkill scrapy can stop scrapy process, however there are more than one scrapy related job maybe running at the sametime. I can not start other scrapy job – hguser Aug 11 '16 at 09:39
  • @hguser Not sure that I understood you correctly. The last sentence at least. So you need that new scrapy crawl good terminate only the previously started scrapy crawl good process, and some other scrapy process starter script would terminate only the same scrapy process started previously? Does pkill 'scrapy crawl good' terminates the desired scrapy process when run simply from the command line? – whtyger Aug 11 '16 at 10:08
  • Your understanding is right. pkill 'scrapy crawl good' does not stop the scrapy process :( – hguser Aug 11 '16 at 11:16
  • @hguser Yep, -f is critical, otherwise pkill doesn't terminate scrapy. Forgot to ask: did you replace USERNAME in the script with valid name of user? – whtyger Aug 11 '16 at 11:36
  • @hguser Forget about trap. I updated my script. Now when I run it in 2 terminal windows side by side, I see that the latter process produces SIGTERM in the former one and shutdowns the spider in it. Now it SHOULD work. – whtyger Aug 11 '16 at 12:06
  • No, the scrapy is not closed. – hguser Aug 11 '16 at 14:14
  • @hguser You run both scripts from cron, one by one? Or you run first from cron, and then second using console? Have you tried running both copies of this script from console? – whtyger Aug 11 '16 at 14:43
  • I run the script in two console windows. The later does not kill the former. And I tried to use the pkilll -9 -P ..., then I can see the killed printed in the former console, but it is still running – hguser Aug 12 '16 at 00:51
  • @hguser pkill produces -15 signal by default, it causes graceful shutdown of scrapy. I noticed that even after SIGTERM I see messages Crawled while closing spider and GET messages. But when it finishes at last, I see (shutdown) state instead of (finished) in case of normal exit. Maybe multiple opened TCP sessions aren't closed instantly, that's why the spider remains among the processes. Can you increase the delay to, say, 30s or even more to get the final result of the former script? Besides that I'm out of ideas - it works here, dunno why it doesn't work as intended on your side. – whtyger Aug 12 '16 at 07:01
  • Now, I use this script PID=$$;export PID; pgrep -f goods.sh | xargs -I {} sh -c 'if [ "{}" != "$PID" ] ; then echo -{} ; fi' | xargs echo | xargs kill -9 -- It works :) – hguser Aug 12 '16 at 07:36
  • E, It works when running at the terminate but once in cron, it will kill ifself too. :( – hguser Aug 12 '16 at 08:42
1

Maybe you should monitor if script is running by creating parent shell script pid file and try to kill previous running parent shell script by checking pid file. Something like that

#!/bin/sh
PATH=$PATH:/usr/local/bin
PIDFILE=/var/run/scrappy.pid
TIMEOUT="10s"

#Check if script pid file exists and kill process
if [ -f "$PIDFILE" ]
then
  PID=$(cat $PIDFILE)
  #Check if process id is valid
  ps -p $PID >/dev/null 2>&1
  if [ "$?" -eq "0" ]
  then
    #If it is valid kill process id
    kill "$PID"
    #Wait for timeout
    sleep "$TIMEOUT"
    #Check if process is still running after timeout
    ps -p $PID >/dev/null 2>&1
    if [ "$?" -eq "0" ]
    then
      echo "ERROR: Process is still running"
      exit 1
    fi
  fi 
fi

#Create PID file
echo $$ > $PIDFILE
if [ "$?" -ne "0" ]
then
  echo "ERROR: Could not create PID file"
  exit 1
fi

export PATH
cd ~/spiders/goods
scrapy crawl good
#Delete PID file
rm "$PIDFILE"
iuuuuan
  • 224
0

too simple :

#!/bin/bash 

pids=($(pidof -x sample.sh))

if [ ${#pids[@]} -gt 1 ] ; then 
                echo "Script already running by pid ${pids[1]}" 
                exit 
fi

echo "Starting service "
sleep 1000
mah454
  • 171
0

It can be very tricky to correctly identify exactly the process(es) belonging to another invocation of the command you're about to run based on a listing of all current processes.

Therefore, a well-established solution to this problem in the Unix world is to use a so-called sentinel file, usually containing nothing but the process id (PID) of the process creating the file (and called a pidfile for that reason).

Prior to invoking the command, you try to create the file with exclusive write access. If this fails, you bail out. If not, you run the command, and after completion, you remove the file.

Now if the command is killed with -KILL, or if the host loses power, you may end up with a lockfile for which the process has died. So at some point you should clean up lockfiles for which no corresponding process is running. This is why the process ID is written to the lockfile.

Using file locking (which wasn't always available in early Unix, but it is in Linux today), you don't need to use the process ID: you can attempt to lock the file, creating it if it doesn't exist. If the process that created the file dies, the file will still exist, but the lock on it will have gone.

Linux now has a standard utility to do this for you: flock (see its manpage). You can wrap it around arbitrary commands.

0

Well, I had a similar problem with C using popen() and like to kill after a timeout parent and all childs. the trick is set a process group ID while starting your parent to don't kill myself. how to do this can be read here: https://stackoverflow.com/questions/6549663/how-to-set-process-group-of-a-shell-script with "ps -eo pid,ppid,cmd,etime" you can filter along the runtime. so with both informations you should be able to filter all old processes and kill them.

0x0C4
  • 713
0

You could check an environment variable to track the status of the script and set it appropriately at script start something like this psuedo code:

if "$SSS" = "Idle"
then 
    set $SSS=Running"
    your script
    set $SSS="Idle"

You can also track status by creating/checking/deleting a marker file like touch /pathname/myscript.is.running and using if exist at launch and rm /pathname/myscript.is.running at end.

This approach will allow you to use different identifiers for your different scrapy scripts to avoid killing the wrong ones.

Regardless of how you track the status of your script and whether you deal with the problem by prevention of launch or killing the running process, I believe that using a wrapper script as suggested by @JacobVlijm & @Serg will make your life much easier.

Elder Geek
  • 36,023
  • 25
  • 98
  • 183