Needing to do a very bulk Bulk Update

Events happening in the community are now at Drupal community events on www.drupal.org.
JuliaP's picture

We need to run the bulk update feature on Path Auto for approx 100,000 articles. Even with increasing the maximum allowed bulk upload this is going to take ages to do. Any ideas? thanks in anticipation.

Comments

miro_dietiker's picture

I hope your page runs on a dedicated server! If so you're able to do it efficiently..

A:
Remain on a bulk update value of about 1000 and trigger it very frequently, right after previous trigger finished. 100 Calls shouldn't take that long..
This way you're still capable of checking the server behaviour during the trigger. I'd expect a process time of 30..60sec. You still can stop triggering again and again if the server is not that healthy as it should be meanwhile.

B:
Increase PHPs maximum execution time (e.g. php.ini), increase the bulk update amount to 100'000 and call the trigger..
Don't forget to allow PHP to consume a huge amount of memory (i don't know the amount needed, i'm pretty sure it should work with normal memory settings but you should avoid script kills due to such sideconditions when doing such hacks..)
I'd recommend to take the site offline during this process since the server will have a lot of load meanwhile.
Backup the DB first and be also ready to kill apache/httpd if something is going wrong.
You should calculate the expected total time it will take (by measuring the cron runtime with e.g. 100) and have a lot of patience during the single call.
There's no progress bar and no thing you could do except waiting. I wouldn't like that because there's no tool to interrupt the update nicely. Finally undo your changes to php.ini and bulk counter.

Processing chunks of such a challenge is always the best idea and processing a lot of data always takes time.

Have fun .-)

Otherwise

earnie@drupal.org's picture

Otherwise it will take ages as you've already noted. But if you have 100,000 nodes that you execute 100 for every 10 minutes you will have updated 600 in one hour and 14400 in 24 hours. That will take just shy of 7 days to execute. If you increase the 100 to 1000 it will take less than a day to execute all of them.

markus_petrux's picture

If you create a subfolder under your Drupal installation, something like /scripts, then provided that your Apache configuration allows you to override PHP settings per subdirectory, you could create a .htaccess file similar to this:

php_value memory_limit 64M
php_value max_execution_time 3600
php_value mysql.connect_timeout 3600

In that folder you could create a supercron.php script like this:

<?php
/**
* This supercron.php script is just a wrapper to Drupal's cron.php
* that can run with different PHP settings.
*/

// Move execution context to Drupal root.
chdir('..');

// Now, run normal Drupal's cron.php
require './cron.php';

Finally, adjust your crontab to run this script instead of Drupal's cron.php.

# This is normal Drupal's cron invocation.
#0 * * * * wget -O - -q -t 1 http://www.exeample.com/cron.php
# This is our new cron script with potentially more PHP resources available.
0 * * * * wget -O - -q -t 1 http://www.exeample.com/scripts/supercron.php

I haven't tested this, but it might be an option that will prevent you from increasing PHP settings for the whole site. Just for Drupal cron executions.

all great ideas

greggles's picture

Lots of great feedback here - thanks everyone.

There is also a handbook page Bulk updating Pathauto node aliases from cron or command line. That could be updated with some of the ideas in this thread and could also help inform your decision, JuliaP.

--
Growing Venture Solutions | Drupal Dashboard | Learn more about Drupal - buy a Drupal Book

Thanks for the suggestions!

JuliaP's picture

Thanks for the suggestions!

cron it

damienmckenna's picture

Idea 1:

  • Set the "batch update count" to as high as your server will accept. I couldn't set my server past 200 otherwise it'd fail and I couldn't be sure how many were done.
  • I created a dupe of cron.php that could allow a specific module's cron function be executed.
  • I then set a bash script to run it oh, say, 100 times. 100*200 = 20,000. Or run it 1000 times. You get the picture.

Idea 2:

Write a module that lets you do the same and that would use a Javascript page refresh call to keep cycling for as long as you wanted.

a different approach.

mikejoconnor's picture

I take a different approach. One of my clients has over 1,000,000 url_aliases. In order to generate all of the aliases, I enabled php-cli on the server server, and wrote a simple script to bootstrap Drupal, and execute the bulk update.

In the long run, I think a better solution is adding an additional database table to keep track of which items have url_aliases. It could simply contain url_alias key, url type(node, term, etc), and the type_id(nid, tid, etc). I believe apachesolr does(or did) something similar to track which nodes have been indexed.

Paths

Group notifications

This group offers an RSS feed. Or subscribe to these personalized, sitewide feeds: