8

I have written a small Python scraper (using Scrapy framework). The scraper requires a headless browse... I am using ChromeDriver.

As I am running this code on an Ubuntu server which does not have any GUI, I had to install Xvfb in order to run ChromeDriver on my Ubuntu server (I followed this guide)

This is my code:

class MySpider(scrapy.Spider):
    name = 'my_spider'
def __init__(self):
    # self.driver = webdriver.Chrome(ChromeDriverManager().install())
    chrome_options = Options()
    chrome_options.add_argument('--headless')
    chrome_options.add_argument('--no-sandbox')
    chrome_options.add_argument('--disable-dev-shm-usage')
    self.driver = webdriver.Chrome('/usr/bin/chromedriver', chrome_options=chrome_options)

I can run the above code from Ubuntu shell and it execute without any errors:

ubuntu@ip-1-2-3-4:~/scrapers/my_scraper$ scrapy crawl my_spider

Now I want to setup a cron job to run the above command everyday:

# m h  dom mon dow   command
PATH=/usr/local/bin:/home/ubuntu/.local/bin/
05 12 * * * cd /home/ubuntu/scrapers/my_scraper && scrapy crawl my_spider >> /tmp/scraper.log 2>&1

but the crontab job gives me the following error:

Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.6/site-packages/scrapy/crawler.py", line 192, in crawl
    return self._crawl(crawler, *args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/scrapy/crawler.py", line 196, in _crawl
    d = crawler.crawl(*args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/twisted/internet/defer.py", line 1613, in unwindGenerator
    return _cancellableInlineCallbacks(gen)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/twisted/internet/defer.py", line 1529, in _cancellableInlineCallbacks
    _inlineCallbacks(None, g, status)
--- <exception caught here> ---
  File "/home/ubuntu/.local/lib/python3.6/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/scrapy/crawler.py", line 86, in crawl
    self.spider = self._create_spider(*args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/scrapy/crawler.py", line 98, in _create_spider
    return self.spidercls.from_crawler(self, *args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/scrapy/spiders/__init__.py", line 19, in from_crawler
    spider = cls(*args, **kwargs)
  File "/home/ubuntu/scrapers/my_scraper/my_scraper/spiders/spider.py", line 27, in __init__
    self.driver = webdriver.Chrome('/usr/bin/chromedriver', chrome_options=chrome_options)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/selenium/webdriver/chrome/webdriver.py", line 81, in __init__
    desired_capabilities=desired_capabilities)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 157, in __init__
    self.start_session(capabilities, browser_profile)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 252, in start_session
    response = self.execute(Command.NEW_SESSION, parameters)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
    self.error_handler.check_response(response)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: unknown error: Chrome failed to start: exited abnormally
  (unknown error: DevToolsActivePort file doesn't exist)
  (The process started from chrome location /usr/bin/google-chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.)
  (Driver info: chromedriver=2.41.578700 (2f1ed5f9343c13f73144538f15c00b370eda6706),platform=Linux 5.4.0-1029-aws x86_64)

Update

This answer help me solve the issue (but I don't quite understand why)

I ran echo $PATH on my Ubuntu shell and copied the value into the crontab:

PATH=/home/ubuntu/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
05 12 * * * cd /home/ubuntu/scrapers/my_scraper && scrapy crawl my_spider >> /tmp/scraper.log 2>&1

Note: As I have created a bounty for this question, I am happy to award it to any answer which explains why changing the PATH solved the issue.

3 Answers3

11

This is the reason of almost all the cases where cron doesn't seems to run.

Cron always runs with a mostly empty environment. HOME, LOGNAME, and SHELL are set; and a very limited PATH. It is therefore advisable to use complete paths to executables, and export any variables you need in your script when using cron.

You can also:

Also, you can't use variable substitution as in shell, so a declaration like PATH=/usr/local/bin:$PATH is interpreted literally.

Pablo Bianchi
  • 15,657
  • Thanks for this Pablo, just one question, when I set the PATH in a cronjob, is the PATH modified within the scope of that cronjob only, or does it change $PATH variable that is used everywhere else on Ubuntu? – Hooman Bahreini Nov 02 '20 at 21:44
  • Neither. If you set the variable at the beginning of the crontab will be there only for following cron lines. You can set a variable only for a line, though. You might also find useful to export, and setting * * * * * /usr/bin/env > /tmp/env_on_crontab. – Pablo Bianchi Nov 03 '20 at 01:58
  • Ah, thanks... I actually have 2 lines in my crontab... the job is executed twice a day, morning and evening... so do I need to set the path two times in the crontab? i.e. once before each line? – Hooman Bahreini Nov 03 '20 at 02:02
  • Only one is necessary. Also, you probably will need only one cron line. Check with crontab.guru – Pablo Bianchi Nov 05 '20 at 17:40
1

The commands readlink, dirname and cat could not be located because /bin is not included in the PATH environment variable.

Explain

unknown error: Chrome failed to start: exited abnormally The process started from chrome location /usr/bin/google-chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.

Try to set PATH=/usr/local/bin:/home/ubuntu/.local/bin/ and execute /usr/bin/google-chrome --no-sandbox --headless --disable-dev-shm-usage you'll get

/usr/bin/google-chrome: line 8: readlink: command not found
/usr/bin/google-chrome: line 10: dirname: command not found
/usr/bin/google-chrome: line 45: exec: cat: not found
/usr/bin/google-chrome: line 46: exec: cat: not found
1

You can also try this one. Crontab opens a new shell for user ubuntu.

05 12 * * *   su - ubuntu -c 'cd /home/ubuntu/scrapers/my_scraper && scrapy crawl my_spider >> /tmp/scraper.log 2>&1'