I am working on a new Drupal site and it appears to be pretty well tuned for traffic. There are currently 450,000 nodes that are updated on a weekly basis. I am running into some issues with drupal_execute. The script I have created queries a mysql database for all the information for nodes and loops through using drupal_execute to either create a node if one doesn't exist or update if it finds a match based on a specific CCK ID. I can get through about 40,000 nodes before my server comes to a crawl and starts swapping.
Does anyone have suggestions on making drupal_execute run more efficiently? Is there a way I could off-load this task or bootstrap on another machine?
I currently have a dedicated server with 1.5 GB of RAM, running Drupal 5, MySQL 5, Apache 2. The other alternative I have ATM is putting a box up in house with better specs on a T1 connection but would prefer the bandwidth on the dedicated server.
Secondly, since I know when the site is updated I would like to index the all the new information at once verses a few hundred nodes every time the cron runs. I am noticing some slow queries that are coming from the search index which currently has 8,000,000 lines and growing. Are there any suggestions for Drupal's search module?
Thanks,
Ryan

Comments
Break up the run
Running 40k+ drupal_executes as fast as possible is a great way to flood your site. Just think of what the equivalent "real world" scenario would be: 1000s of users adding nodes at a furious pace all at once. Supporting that would need a cluster at least.
Given what you've said I think the best solution would be to batch your update process so that it doesn't overheat. Maybe check into the job_queue and drupal.sh systems.
http://www.chapterthree.com | http://www.outlandishjosh.com
https://pantheon.io | http://www.chapterthree.com | https://www.outlandishjosh.com
40,000 is a lot ... but ...
@joshk has a point, but if you are doing drupal_execute() in sequence (not in parallel), then it it should not "overheat" that easily.
The fact that it can go to 40,000 is in itself admirable, but you have not said what eats up memory before you go into thrashing. Is it MySQL? Can it cope up with this kind of rapid fire? Is it the script run from cron? Do you see it eat memory more and more until you swap? Or is the server busy doing other stuff (normal traffic even) that eventually kills everything?
Find out what eats up the memory and there can be solution around it.
I agree with josh thought that job_queue can help, but it depends on your needs and design. Also the batch API in Drupal could help if you were on D6 (you are not).
For search, look into Xapian and Sphinx Search. See if they can overcome the issue you are facing.
Drupal performance tuning, development, customization and consulting: 2bits.com, Inc..
Personal blog: Baheyeldin.com.
Drupal performance tuning, development, customization and consulting: 2bits.com, Inc..
Personal blog: Baheyeldin.com.
Good point
That's a good point about parallel vs. serial, Khalid. I suppose if the script is running them in serial order, it could also be just memory leakage getting up to the point where that one PHP thread is enough to take the system into swap. But yes, the real question is figuring out where the thrashing starts.
http://www.chapterthree.com | http://www.outlandishjosh.com
https://pantheon.io | http://www.chapterthree.com | https://www.outlandishjosh.com
Yeah
Yeah
That was my thinking too.
However, I would be surprised to see that PHP can go on and on and exhaust 1GB. Don't think anyone would set memory_size that high.
Could be MySQL that is leaking, or something else.
So, it is best to have actual data on what consumes that memory before the swapping.
Drupal performance tuning, development, customization and consulting: 2bits.com, Inc..
Personal blog: Baheyeldin.com.
Drupal performance tuning, development, customization and consulting: 2bits.com, Inc..
Personal blog: Baheyeldin.com.
Hello Guys, Thanks for the
Hello Guys,
Thanks for the quick response and suggestions. I have restarted the server and am currently monitoring the script usage. So far it has gotten through about 25,000 drupal_execute commands. The PHP command is increasing fairly quickly and is using 25% of the system memory according to top. ATM I have just been running the php script from the shell. Now that I think about it doesn't that bypass the memory restrictions? I have the script set to 256MB of allowed memory.
So it looks like there is a memory leak in the php code. Is that safe to say?
MySQL and Apache aren't breaking a sweat memory wise. MySQL starts out around 1k queries per sec.
Seems like a leak yes
Seems like a leak in PHP. PHP from the command line should still obey the memory_limit directive. The fact that it does not stop at 256M and that you are not using Apache, points the problem to be within PHP.
But there are some things you can try. For example, do unset() on any variables you are using. Do not attempt repeat code that allocates resources (e.g. db connection, ...etc.). Hopefully this can overcome it.
The other option is the wonderful world of workarounds: you give the script a number to process then it quits. You can put checkpoints every 1000 or so, and write that a temp file, then use that on the second iteration. The old script can even spawn itself which will be a new instance pf PHP and hence avoiding the root issue.
Drupal performance tuning, development, customization and consulting: 2bits.com, Inc..
Personal blog: Baheyeldin.com.
Drupal performance tuning, development, customization and consulting: 2bits.com, Inc..
Personal blog: Baheyeldin.com.
"Memory Leak" Is A Strong Term
Heh, well as the title says, "Memory Leak" is a strong term :)
Really, Drupal (as are all web applications and most php scripts) is designed to run in cycles based off url requests, so PHP can do cleanup and flush memory after the page is built. This is why PHP has built in memory limit and execution time limits. What you're simulating here is the page request that runs and runs and runs until resources are exhausted.
Batching your requests is the answer. Have the script run through 1,000 at a time, and at the end of each cycle PHP will do its garbage collection and you can begin on the next batch w/full system resources.
OTOH, there may be some hardcore core engineers who would be interested in seeing how this use-case exposes some areas for optimization, but my sense is that this is par for the course if you utilize a web-app designed to run in (relatively) short cycles for a process that may go on for quite a while.
http://www.chapterthree.com | http://www.outlandishjosh.com
https://pantheon.io | http://www.chapterthree.com | https://www.outlandishjosh.com
PHP issue ...
If PHP is set to use only 256MB, then it should abort upon it using more than that. Fact is, it does not.
Therefore, it is not a Drupal problem per se, and is a PHP issue (leak).
Yes, Drupal could perhaps clean after itself better when doing drupal_execute() and such but where do we draw the line? It could be in node_save(), it could be a static variable somewhere being added to, it could be in any other place.
As you said this use case is interesting, and perhaps some unset()s are needed in the right places, but where?
Still, even if we did that, PHP is not aware that it is leaking something.
So, two areas here: Drupal and PHP.
Workarounds to the rescue ...
Drupal performance tuning, development, customization and consulting: 2bits.com, Inc..
Personal blog: Baheyeldin.com.
Drupal performance tuning, development, customization and consulting: 2bits.com, Inc..
Personal blog: Baheyeldin.com.
node_load() static data?
<?php
function node_load($param = array(), $revision = NULL, $reset = NULL) {
static $nodes = array();
if ($reset) {
$nodes = array();
}
?>
If node_load() is executed in the process, then it may be eating a lot of memory depending on the size of the nodes and the number of node_load() executions.
If you cannot control how node_load is invoked, then maybe in the main loop of the process you can simply introduce a call such as
node_load(FALSE, NULL, TRUE)(this works ok in D6), which returns FALSE and clears the static data.For D5, the call to clear the static data would have to look something like
node_load(0, NULL, TRUE), where 0 is non-existing nid.Workaround
So what I ended up doing was creating a bash script which checks for an iterator file, if it exists it will continue running my php script until all nodes have been updated or created. It has taken 48 hours to get through 420,000 nodes. I will work on optimizing the script a little bit and the suggested node_load cache issues.
node_load is called every time a match is found. Usually there are only a few thousand new nodes so that means node_load is being call around 400,000+ times.
There!
I suspected a static variable in the thread above, and markux_petrux confirmed it and provided a solution.
Just go back to your earlier script, and add a reset called to node_load() every 100 nodes or so, and rerun it.
See if that solves the problem before you do any workarounds.
Also, if you still has the process exceeding PHP's memory_limit value, then it would be a PHP bug (memory leak) that should be reported.
Drupal performance tuning, development, customization and consulting: 2bits.com, Inc..
Personal blog: Baheyeldin.com.
Drupal performance tuning, development, customization and consulting: 2bits.com, Inc..
Personal blog: Baheyeldin.com.
Hi, I think it's not the
Hi,
I think it's not the problem of node_load(), but the PHP itself. The object's destructor only called when all of the references, member variables were freed.
Aries
Bash workaround
I did drop the memory down to 128MB and was able to get the script to error out. I have tried unsetting all variables which I had created and emptying all arrays with no luck. I think the workaround I an going to move forward with would be creating a bash script which can be run weekly. This will check an iterator file and keep re-running my php script until it has completed all rows. This way at least the php process ends and creates a new one every x number of rows. Hopefully that will make it through without choking out PHP.
What do you guys think?
How would I find out if the search module indexes nodes on creation or updates?
try markus suggestion first
See just above in http://groups.drupal.org/node/17211#comment-58794
The node_load cache may be your problem. It was a source of problems for Pathauto bulkupdates until we disabled the caching:
$node = node_load($node_ref->nid, NULL, TRUE);Even if node_load is not the specific problem...some other caching internal to Drupal might be.
--
Growing Venture Solutions | Drupal Dashboard | Learn more about Drupal - buy a Drupal Book
knaddison blog | Morris Animal Foundation
Even reset of node_load does not work ...
I used this module, and ran cron, but it still blew up the memory. I had it configured for 64MB, and attempts to reset after 50 nodes.
However, as you can see, memory keeps climbing even if I try to reset the static cache.
This is Drupal 6.x-dev, and PHP 5.2.4 with APC.
<?php
function custom_cron() {
$result = db_query("SELECT nid FROM {node}");
// Initialize counters
$processed = 0;
$total = 0;
// Loop through the nodes
while ($data = db_fetch_object($result)) {
$total++;
$processed++;
// Reset the static cache every once in a while, so we don't blow up memory
$x = NULL;
$x = node_load($data->nid);
if ($processed >= 50) {
$mem = number_format(memory_get_usage()/1024/1024, 1);
print "Resetting. Total=$total mem=$mem<br>\n";
$y = NULL;
$y = node_load(array('nid' => 1), NULL, TRUE);
$processed = 0;
}
}
}
?>
Here is the output :
Resetting. Total=50 mem=6.5Resetting. Total=100 mem=6.7
Resetting. Total=150 mem=7.5
Resetting. Total=200 mem=8.3
Resetting. Total=250 mem=9.1
Resetting. Total=300 mem=9.8
Resetting. Total=350 mem=10.5
Resetting. Total=400 mem=11.3
Resetting. Total=450 mem=12.3
...
Resetting. Total=3850 mem=62.1
Resetting. Total=3900 mem=62.9
Resetting. Total=3950 mem=63.7
Then it aborts with Fatal error: Allowed memory size of 67108864 bytes exhausted (tried to allocate 76 bytes) in ../includes/database.mysqli.inc on line 144
Using
node_load(FALSE, NULL, TRUE)did not make a difference.Does $nodes = array() not reset memory?
Berchem, we have a problem ...
Drupal performance tuning, development, customization and consulting: 2bits.com, Inc..
Personal blog: Baheyeldin.com.
Drupal performance tuning, development, customization and consulting: 2bits.com, Inc..
Personal blog: Baheyeldin.com.
I just ran your test script,
I just ran your test script, but I dropped {hook}_node_lode and nodeapi('load') out of it and it worked as expected (mem stayed at 2.6 all the way through), so I would venture to say something is being called in one of those hooks that is also caching data. The most obvious one would be in taxonomy_node_get_terms().
HollyIT - Grab the Netbeans Drupal Development Tool at GitHub.
Unclear
Are you saying that you kept node_load() in the unmodified script but modified node.module (or friends) to not call the hooks?
Yes, I agree that taxonomy_node_get_terms() caches terms but on a vocabulary basis. This site has 11 vocabularies, and 1000 terms only. Is that enough to consume that much memory? How many do you have?
To rule this out, just disable the taxonomy module and rerun the test, which will prove beyond doubt that it is taxonomy module.
Drupal performance tuning, development, customization and consulting: 2bits.com, Inc..
Personal blog: Baheyeldin.com.
Drupal performance tuning, development, customization and consulting: 2bits.com, Inc..
Personal blog: Baheyeldin.com.
I just commented out these
I just commented out these parts in node.module:
if ($extra = node_invoke($node, 'load')) {
foreach ($extra as $key => $value) {
$node->$key = $value;
}
}
if ($extra = node_invoke_nodeapi($node, 'load')) {
foreach ($extra as $key => $value) {
$node->$key = $value;
}
}
I was only using Taxonomy as an example, since I know off hand that does do a static caching. To really know what is causing the memory usage to increase you would have to go through and look at all the nodeapi("load") functions and see if anything in those is doing some static caching.
HollyIT - Grab the Netbeans Drupal Development Tool at GitHub.
It is taxonomy
I can confirm it is taxonomy now.
I did not modify any code, but disabled the taxonomy module, and was able to do all the 5000 nodes in that module.
Now, the issue is that this function in taxonomy has no reset function. Perhaps we should have a patch to do that, a la node_load(). Or preferrably, when node_load(x, x, TRUE) is called, it would also flush the static cache of taxonomy as well.
Drupal performance tuning, development, customization and consulting: 2bits.com, Inc..
Personal blog: Baheyeldin.com.
Drupal performance tuning, development, customization and consulting: 2bits.com, Inc..
Personal blog: Baheyeldin.com.
I've been working on a
I've been working on a similar processing system, but only for 25,000 user profile nodes that update each week.
I too ran in to the memory issue and opted for a system which first split up the 25,000 row CSV file in to file chunks of 1000 rows per file. I then use drupal_http_request() to call a custom url in my module which then processes a specified chunk file. This ensured that after processing 1000 drupal_execute() calls the PHP script ended and i could start the next call with a fresh bit of memory.
My problem is that the drupal_http_request() calls 25 chunk files in total (taking about 1.5 hours). However the HP script calling the drupal_http_request() in a loop finishes quicker and leaves MySQL drowned in queries to perform. Is there a way to stop this?
All I really need is to know how to detect when all the MySQL queries are complete, so that I can re-enable the import form again. I lock it after an import starts to prevent duplicate imports being started and killing the server.
Similar issues with node_save()
Similar issues can arise with repeated calls to node_save() within the same request.
Static caching by other modules responding to node_invoke_nodeapi() is an issue in my case.
As part of my node processing I need to do some simple updates that do not need to issue node_invoke_nodeapi().
So I created a local copy of node_save() and disabled the calls to node_invoke_nodeapi().
This non invoking node_save() variant performed about twice as quickly and used approx half the memory too.
Brillant hint!
I had the saving problem, too. Every saved node consumed about 300 KB of memory. I out-commented the invoke stuff from the node_save() function, and there is absolutely no memory loss, plus the script runs like 10 times faster...
A few solutions I've come up
A few solutions I've come up with
<?php/**
* Temporarily disable module hooks on a per-script basis
* Useful because it doesn't disable modules in system table
* @param array $modules
* @param array $hooks
*/
function custom_disable_modules_hooks($modules, $hooks) {
$orig_modules = $list = module_list();
foreach ($modules as $module) {
unset($list[$module]);
}
// Reset module list
module_list(FALSE, FALSE, TRUE, $list);
// Reset implementations
module_implements('', FALSE, TRUE);
// Add each hook to implementations, which will call our modified module list
foreach ($hooks as $hook) {
module_implements($hook);
}
// Put module list back to normal, so other hooks work normally
module_list(FALSE, FALSE, TRUE, $orig_modules);
}
?>
And you can call this function like this:
<?php$mods_disable = array('taxonomy', 'path', 'pathauto', 'rules', 'og');
$hooks = array('nodeapi');
custom_disable_modules_hooks($mods_disable, $hooks);
?>
So far the modules I've found that increase memory in my imports are taxonomy, path, pathauto, rules and og. Without some of these modules, my imports run a hell of a lot faster too.
I narrowed it down to the
I narrowed it down to the token module, via pathauto (this line $tokens['global']['default'] = module_invoke_all('token_values', 'global'); in token_get_values() which is called by pathauto_nodeapi).
Commenting out this stops memory usage jumping by large amounts for every node I save.
I'm not going to delve any further. I can disable pathauto while to node creation process runs and create my own aliases.
Issue I hit
http://drupal.org/node/202319#comment-2021584
This post has been very
This post has been very useful. I have a PostNuke site with hundreds of 1000s of nodes to create under Drupal during migration. It's all driven from a shell script, which I'd changed to batch things up. All well and good for 10,000 locations, but when it came to close to half a million forum posts and (node_comment) replies I decided to revisit the issue, which is when I found this thread.
I changed node.module, node_invoke_nodeapi to the following to highlight where all the memory was being eaten. My changes are not indented.
function node_invoke_nodeapi(&$node, $op, $a3 = NULL, $a4 = NULL) {
static $m=0;
$return = array();
foreach (module_implements('nodeapi') as $name) {
$function = $name .'_nodeapi';
$b=memory_get_usage();
$c=$b-$m;
echo "\nPre-change of ".number_format($c)."\n";
echo "Memory before $op $function ".number_format($b)."\n";
$result = $function($node, $op, $a3, $a4);
$a=memory_get_usage();
echo "Memory after $op $function ".number_format($a)."\n";
$c=$a-$b;
echo "Change $op $function ".number_format($c)."\n";
$m=$a;
if (isset($result) && is_array($result)) {
$return = array_merge($return, $result);
}
else if (isset($result)) {
$return[] = $result;
}
}
return $return;
}
Seems the main culprit is op:insert function:pathauto_nodeapi which is using between 250 and 300K per node_save.
Now I just need to decide what to do about it.
My guess is that it's the
My guess is that it's the transliteration that causes the high memory usage. Is this D6 or D7? Is it reasonable for you to forego transliteration of pathnames?
--
Dave Hansen-Lange
Director of Technical Strategy, Advomatic.com
Pronouns: he/him/his
I would just disable pathauto
I would just disable pathauto while importing (which makes it also way faster and is I think even proposed in pathauto docs for importing), and then just let pathauto batch-create the aliases later.
You can also set $node->pathauto_perform_alias = FALSE before the node_save to reach this goal without disabling the module.
That should workaround the problem at least ...
Best Wishes,
Fabian
This is D6. The call in
This is D6. The call in pathauto.module, pathauto_nodeapi causing the issue is pathauto_get_placeholders (line 527 in v6.x-1.5 of pathauto). I haven't dug in to that function at all.
Fabianx's solution resolved the issue for me. Thank you very much!
Testing with 327 nodes, this solution reduced memory usage from 100M and a runtime of 124 seconds, down to 7M and 30 seconds. Double bonus!
To bulk update from my script I just run node_pathauto_bulkupdate() after I've imported each node type. In my test case above, this uses up 22M, so nothing like the extra 93M it was using. I'll probably create a single call from my script to do them all once the import is complete.
Read this for how: http://drupal.org/node/236304
I used
echo "\nBulk alias updates...";module_load_include('inc', 'pathauto', 'pathauto');
module_load_include('inc', 'pathauto', 'pathauto_node');
node_pathauto_bulkupdate();
You may also want to use forum_pathauto_bulkupdate(), user_pathauto_bulkupdate() and any other *_pathauto_bulkupdate() functions required by any modules you're using.
Ah, spanner meet
Ah, spanner meet works.
node_pathauto_bulkupdate will only update 50 nodes at a time (by default). I need to update half a million nodes! I ended up calling my own version of that function from my migration code and changed it to return TRUE when the $count of nodes updated is less then the 50 limit, otherwise FALSE. That way I can keep calling it in a loop. But of course it will eventually run out of memory for the same reason as node_save did. I think I'll just keep calling it manually from the command line until all the aliases are generated. I can't see an easier (as in, "less hassle") solution really.
Hi, That really is what batch
Hi,
That really is what batch API is for:
http://drupal.org/node/180528
That will do a new page-request and such not run into the out-of-memory problem ...
Best Wishes,
Fabian
To follow up a little on
To follow up a little on that, memory usage is close to 0 with forum/node_comment nodes. I expect the memory usage for the 327 "place" nodes above was caused by the Location module (at least in part) as I noticed that was using 10k per node too. So... even better news.