Using httprl for parallelization

We encourage users to post events happening in the community to the community events group on https://www.drupal.org.
cotto's picture

Recent improvements to mikeytown2's httprl module make it a fairly powerful tool for parallelizing certain classes of long-running operations, allowing them to scale horizontally across one or more servers. I found a great use for this when I needed to to generate a list of one-time login tokens for all users on one of our sites. The first version worked well, but took several minutes and had to be run as a cron job through drush. For the next iteration, I wanted to generate the list quickly (i.e. in 30 seconds or less) from an admin UI. This turned out to be a great place to experiment with the parallelization changes to httprl and resulted in a substantial speedup after a bit of tuning.

At a high level, httprl now lets developers call one or more arbitrary PHP functions in parallel and to process the return values. This helps with tasks that involve doing the same thing to a large number of items. It our case, we took a job that was taking 44 seconds down to 16. There's potential for further improvement by using more than one machine, but that's already a good result. In the rest of this post I'll walk through identifying, parallelizing and tuning a job using httprl. This post assumes that you're using httrl 1.6 or later.

The code in question started out looking something like this:

<?php
# uids now contains about 17k uids
$uids = get_long_list_of_relevant_uids();
foreach (
$uids as $uid) {
 
# do some side-effect-free work for a user, spit out the result
 
echo join(',', generate_data_for_user($uid)) . "\n";
}
return;
?>

This got the job done but wasn't particularly speedy. I could theoretically parallelize this as-is by processing each user in a separate thread, but that'd lose any efficiency gains to the bootstrap phase of httprl's child threads. Threads currently go through the full Drupal bootstrap, so they need to do enough work to offset that time. Step 1 in parallelizing this code was to break the works into chunks:

<?php
# break uids into chunks of 500 pieces
$uid_chunks = array_chunk(get_long_list_of_relevant_uids(), 500);
foreach (
$uid_chunks as $uid_chunk) {
 
# do some side-effect-free work for a bunch of users
 
echo process_uid_chunk($uid_chunk);
}
return;

function
process_uid_chunk($uid_chunk) {
 
$ret = '';
  foreach (
$uid_chunk as $uid) {
   
$ret .= join(',', generate_data_for_user($uid)) . "\n";
  }
  return
$ret;
}
?>

Step 2 was to offload all the work to httprl.

<?php
# break uids into chunks of 500 pieces
$return_values = array();
$uid_chunks = array_chunk(get_long_list_of_relevant_uids(), 500);
foreach (
$uid_chunks as $chunk_number => $uid_chunk) {
 
$return_values[$i] = ''; #the return value will be stored in here, so make sure there's something
 
$callback_options = array(
    array(
     
# this function will be called in parallel
     
'function' => 'process_uid_chunk',
     
# its return value will be placed here
     
'return' => &$return_values[$chunk_number],
     
'options' => array(
       
'domain_connections' => 8),
       
'timeout' => 60
     
),
    ),
   
# pass this as an argument to the callback
   
$uids_chunk,
  );
 
# queue the callback
 
httprl_queue_background_callback($callback_options);
}

httprl_send_request();
# all return values will be placed into $return_values
echo join("\n", $return_values);

# do some side-effect-free work for a user
return;
?>

And with that the list is generated in parallel. httprl takes care of all the details like setting up a menu endpoint, keeping callbacks from being maliciously invoked and dealing with marshalling/unmarshalling arguments into the right places. There are a couple parameters that can be helpful to tune, namely the "domain connections" value and total number of threads. The goal is to have just enough threads to keep all CPUs busy. With too many threads you'll waste time as unnecessary Drupal instances bootstrap and with too few you won't be fully utilizing the available computing power.