I've been thinking about a way to make Drupal's page rendering utilize multiple cores of the server for a single request. Vanilla PHP can be used to create as many child processes as one desires, so the question is how do they communicate and where's a good place in the code to make it MT. My gut says spiting up by region. That would require modifications to these functions most likely.
drupal_render_page
block_page_build
drupal_render
Another thing to keep in mind are some of the more exotic extensions that would make the signaling and forking more efficient; http://php.net/refs.fileprocess.process but most of these still look too exotic IMHO.
So this is how I could see it being done; feel free to rip this apart, means we are at least thinking about utilizing multiple cores per request.
"Traditional Way"
Request comes in; content gets generated and goes to drupal_render_page. Once there an event loop is created; it spins up a new process for each region; passing along the context (butler) and a unique identifier. Context is used to re-create the original request so that region is generated correctly. Unique identifier is used as the key for communication via APC's store and fetch; communication could also be done via files; db; memcache; shared memory; etc... main point is there has to be a way for 2 processes to talk to each other. In the event loop, it checks the "shared memory" to see if that html has been fully rendered. Once all the pieces come back the output is flushed to the web server. One could have a timeout here so after X time skip this region and output what we have.
"Facebook Way"
Javascript is good with async operations; lets utilize that to the fullest. Request comes in; context is built (butler) and region place holders are created. Drupal outputs a bare skeleton of the webpage; some JS, maybe some CSS and a html DOM that has almost zero content; but does have each region; AND a unique identifier. We then use ajax to hit the web server again with this unique identifier and a region name; so if we have 15 regions that is 15 non-blocking requests (get the header and main content first for a quick display of what was requested). The UID from the ajax call matches up to a context, context gets loaded and that region is generated and sent back. Full page caching will be harder to do with this model; as well as setting info in the header like meta tags, CSS/JS, etc... but one can see the advantages of going this route. Pages returns very quickly and if one block is acting up, it doesn't slow down the rest of the site.
Both methods are similar because the child request needs the context and the region. ESI could be used per region as well; doesn't have to be ajax or a php event loop. I can see the services module being the heart of loading up the context & region and outputting the correct info to the requester. Thoughts/Feedback?

Comments
My first thought is that the
My first thought is that the overall average performance should increase. But if we assume that several pages are being generated concurrently then all the page components could end up being generated by the same core, thus greatly decreasing performance for some small percentage of pages.
--
Dave Hansen-Lange
Director of Technical Strategy, Advomatic.com
Pronouns: he/him/his
Bus lock
At first sight, it looks like this should not give a significant increase because
a) most regions need to be built in sequence, not simultaneously, thus limiting the amount of parallelizing that can be done. Anything contextual in a block, for instance, needs to be built after the context has been built, typically either during the boot sequence or during the "content" building.
b) shared memory to assemble the content means inter-core transfers, locking the cpu/ram bus for transfers, thus preventing serialization of the next requests until the slowest "thread" (actually, processes in PHP case) has finished transfers. While in the normal case, each CPU accesses its own memory bank on a dedicated bus.
As a consequence, this could be slower than a "normal" page build on a multi-user workload, because of that locking mechanism.
You might want to look at how other language communities with more extended multi-thread / concurrent programming support handle this. Search for Ruby GIL, for instance. Even though it is deeper into the VM/interpreter, it is still much the same issue.
David is already doing this
David Strauss has something that implements this, check with him.
Kargo
That's Kargo + libevent - https://wiki.fourkitchens.com/display/TECH/Using+Kargo-Event+with+Ubuntu...
On the 'facebook way', the main issue with ESI and the rest is functions like drupal_add_js()/drupal_add_css() and the rest that can be called from anywhere and inject stuff into the header.
If all parts of a Drupal page (including the main content as well as blocks etc.) start using #attached and #pre_render I can see that slowly becoming less of a problem - we run the minimum PHP to get that information, then the page could render with ESI keys or however else without having to actually build the full content to find out what they are.
That's why #attached was added but it is going to take a while for it to become general use, also this bug getting fixed in HEAD would help - http://drupal.org/node/859602
Right. I think we should
Right.
I think we should rename drupal_add_js() to _drupal_add_js() and force devs to evaluate their usage and try to switch to #attached. Do this after making a top level $response['page']['#attached']. $response presumably comes with Butler project.
What about multiple instances
What about multiple instances of node.js servers?
Perfect
This is awesome..Sending back skeleton pages, compressed CSS & JS files, while drupal is generating dynamic contents within regions is great thought. Perhaps these skeleton pages (like front page, about us page or a certain node type page) can be cached and served to client while its busy generating actual content can improve significantly.
There must be a fallback on what happens if javascript is turned off..It might slow down a little bit in that case. Doing has_js, access checks and permission on early bootstrap to reduce any further impact on performance in case js is turned off can be useful. Drupal bootstrap needs to be forked to change the way it behaves currently.
Parallel processing is the key to reduce rendering times in these growing demands.
Parallel is good, but I'm
Parallel is good, but I'm afraid the performances can be improved only on highly dynamic sites with pages full of moving junk. For a lot of sites, such as typical news sites, you don't really need to parallelize anything except concurrent requests from different final clients. In this case, the only thing that really works is to have great reverse cache frontend such as Varnish on top of your Drupal site.
Pierre.
If you are planning to create
If you are planning to create fully dynamically built pages using AJAX and such, you will lower the server capacity to accept numerous concurrent final clients, It may lower the reverse proxy caching capabilities to cache the atomic pieces of your pages if you don't have a good and predictible URL generation method, if they depend on a context. It may lead you to some nice techno stuff such as Comet and such, but it will make Drupal harder to deploy on small environments.
Facebook is really a wrong example, they actually have their own filesystem to efficiently write and read on harddrives in clusters that I'm afraid most of us won't be able to afford or even see in their lifetime. Most of Drupal sites are small to medium sites and doesn't need such tricks.
Pierre.
I like the SSI/ESI method.
I like the SSI/ESI method. Let's look at it from another angle, by using the well known bottlenecks to think about throughput.
You have: CPU, Memory, Disk, and Bandwith.
CPU: Uncompiled PHP is a problem here, and a low load factor is desired.
Memory: Must load from disk, so it tends to be used as a caching solution.
Disk: Database fsync can cause intensive writes.
Bandwith: On a 100mbps card, this will limit concurrency if serving full pages.
Actually, the first bottleneck is bandwith without optimizations.
If serving uncompressed files of 250k each, this will limit concurrency to roughly near 400 before packets can get dropped.
As far as authenticated users are concerned, they should be interacting with the CPU and memory, not the disk, unless committed transactions are absolutely required.
Handlers come before filters generally. This is where SMP can resolve bottlenecks.
Pages generally get served in order like this: html, css (for the end user to notice the page rendering), js, then (non css-embedded) pictures. Since html can be split into divs, html can be made to be less blocking, so skeleton pages is a good idea.
The very last filtering that can be done is probably either compression or marking up some of the files for browser side caching.
Anyway, splitting up the handlers by regions, and filtering by file types makes for distributing the workload across cores.
Anonymous users should generally be served cached disk/memory pages (boost), while for captchas and authenticated users, ajax will be handled by the logical segmentation by regions and file type.
As for the use of the disk for the database, processes marked for filtering (such as compression) can be offloaded from the cpu or php to nginx for processing. Compression can take place asynchronously depending on region and file type.
Regions can be cached, which would result in different handlers. Query information can be preprocessed and sent to memory if using regions, since some metadata can be provided ahead of time.
I don't believe full page caching is harder, as long as everything is indexed properly. The nonblocking method makes better work of multiple cores, since it only handles the logic it needs to.
I agree that atomic caching needs uri consistency. Perhaps use Frame elements, which drupal can use it's own markup on and rebundle as non Frame elements as needed. I don't know drupal internals that well so far, so some of the api calls mentioned above was lost on me.
You are right about a lot of
You are right about a lot of things in here, but I think that making something like all blocks ,for example, to be each one of them a single cache entry for ESI may cost more time in HTTP requests latency between the ESI gate and the web server than the server would have cost to generate the data.
PHP is not compiled, this is not always true, using a good OPCode cache, PHP is compiled, and this goes reduce from 15% to 400% the code execution time depending on the coding, use cases and/or the environement.
I think the real question here is more about finding a good caching policy, but it's not about ESI. If you have a good caching policy for "stuff" that is being displayed, it will be good for performances in both case (one single page creation (classic method) AND page built with multiple request (ESI)).
If you build the page during one single PHP execution time, by fetching 5 or 10 elements into a Memcache through a single socket connection, it will probably be way more efficient than letting the reverse proxy cache doing 5 or 10 HTTP requests (if it needs to of course, the whole goal of caching is not doing these requests always, but with low cache lifetime, this would happen all the time).
So first, a good caching policy must be found, URL or not, I name it "cache identifier" (and that fits for both use case), and the backend URL for the ESI good be fetch/my/. This would benefit for "static" sites and "higly dynamic" sites. If you make a static rendering work with this good cache policy, then the ESI handling will only be a matter of writing some tokens in the exact same page instead of rendering it right away (so this is not the complicated technical part). The real complicated technical part is about generating those identifiers (and also maybe in keeping the context when doing async HTTP requests).
EDIT: Some typo.
Pierre.
It's kind of strange to think
It's kind of strange to think java is bloated when php users make use of memory for opcode caching, but still, php tends to use less memory compared to jvm, for most sites, but at high concurrency, jvm does better. Development time is better with dynamic languages as many know. Buffering to fastcgi by nginx doesn't work as intended either, when it is sent to php.
This link is of interest to you, since it links a module that should be useful, even if you don't use nginx: http://groups.drupal.org/node/125094
I like Memcache socket connections as well, for sessions, and more D7 examples of implementing it in a complementary fashion with APC or other caching backends/functions would be nice.
SSI (nginx) /ESI (varnish) blocks when using ajax should compose 10-25% of a page after initializing a session unless you're a gaming site or the requirement of heavy database usage can make that point moot. Anonymous users wouldn't have to worry about that overhead if using boost since it provides all the components for static pages. Server admins can use tmpfs for the most frequently used pages, so I tend to favor nginx. I like to remember that the Varnish mailing list has a user who has reported 15k+ rps, also I've read that Varnish can use nginx as a faux CDN, while inline C code can be useful for integrating with memcached, so I don't discount Varnish completely, since it does provide some innovative possibilities.
I totally agree about
I totally agree about Varnish, we use it all the time, in front of Drupal sites, Plone sites, Django and Zend specific webapps, it's efficient. But the whole thing is not about making Drupal Varnish-centric but unifying it caching policy so that anyone can use it simply (including Varnish as ESI gate).
You have a difference with Java, is that J2EE is all about container and content. Every piece of software is a component (content) that acts in a bigger container. Every content until the lowest one is itself a container, and every container until the higher one is itself a content. An interface is a bean component, a bean can be a servlet component, the servlet can itself be a tomcat component, which can itself be an application server component. Java runs in a VM which once run, stay tuned ad vitam eternam. Objects can be managed in huge common pools, bean are living in memory at all time, globally memory can be shared. Using Java, C#, some python application servers, you can have a persisting environement. All those VM have threading facilities that does helps greatly by making thread processes being fast and efficient through using easy to use high level APIs.
PHP cannot do this, because it's a scripting language. Even if you manage to keep a PHP deamon, you will have huge memory leaks because it has never been designed to do this. That's why you have, in Java, some web applications such as Alfresco that will eat 700Mb of RAM at init time, even with no content at all, but that will scale greatly because once every component is awake, you do not have any latency in the system anymore while accessing them, they share db connection and objects memory pools. When you scale PHP, you have to scale the server, to accept more clients, but you also often have to scale the database iteself because you cannot share connections between threads. Thinking about threading PHP is itself an huge mistake IMHO, it's not being widely used therefore it may lead to serious side effects. The only solution you have is to keep the scripts running really fast to ensure you can accept the sooner the next client.
So yes, JVM probably eats a lot more memory, but persistent and (almost) stable amount of memory in a environment probably much more scalable (as long as a single physical boxes is enough for your needs, and I'm quite sure there are solutions to scale horizontaly J2EE apps on more than one physical box).
One other side effect is that because you actually have to init each component on each client hit, you have to bootstrap your application everytime, and in Drupal this is probably today the most consuming operation, in Drupal bootstrap you create TCP connections to your memcache, to your database, you do queries, you may even rebuild some cache, do runtime checks on the environement, etc.. If you make concurent request your will do this heavy operation each time (which is why the Butler project exists, in order to reduce this by spawning specialized, minimal, lightweight bootstrap contextes depending on the hit nature (AJAX, web services, frontend page, etc...).
Remember that when you rely on Varnish as an ESI gate you will ask in the best scenario no request, but in the worst, one hit per cache piece. This will work only if your caching policy is good enough to keep long life pieces of caches, else you will bootstrap your webapplication each time.
So, I would tend to think than rather thinking about ESI (which is good in fact because doing this you have a clearer picture of what those contextes means, and you have a better logical on screen visualizable data separation in your mind), you really should think about encapsulated all these pieces of content and caching them in an efficient way, so whatever is the case, ESI or not, you will benefit from it.
Your ESI zoning will naturally inherit from the underlying framework consistency and solidity, but the opposite is totally untrue since if the rendering is messy or crappy, caching or not everything will remain slow and hard to maintain.
Pierre.
To add/summarize: You have to
To add/summarize: You have to solve caching (and most importantly: sane cache invalidation) on an api/general level before you start throwing around technology like ESI. ESI needs to be viewed as an implementation of an improved caching/rendering workflow. One that could be replaced with AJAX, BigPipe, etc.
+1
+1 for you :)
I did some kind of code soem days ago, see http://drupal.org/project/esi_api
It indeed render components based on a cleanup series of GET parameters, which are basically context providers. It is able to either include ESI tags or load using AJAX independently of the being rendered component nature.
Cache (in)validation is based in this information, and arbitrary context information can be supplied by modules components themselves before rendering or cache fetch to ensure no outdated cache comes out (components here are procedural APIs because my first use case were blocks).
Pierre.
Should've checked for grammar
Should've checked for grammar before hitting submit... "offloaded to the cpu", etc. As far as I know, php5-fpm isn't really multicore in my opinion, since each individual request stays on the same core and only api calls to other non php processes or php forking will be multi-process
A PHP runtime is not
A PHP runtime is not multicore, that's for sure, but I know enough FPM (I didn't tested it) but when I read documentation I read it's able to spawn numerous processes for running PHP scripts. That's more than enough say your overall application will benefit from multicore, even if each script remain a single thread.
Pierre.
I think this multi processing
I think this multi processing can work out more in case of CRON based tasks. Where Things like caching and indexing for large sites can be done using multiple threads.
rajarju
Cheaply Start Multiple Background Threads via http Request
I got some ideas cooking on how to use other CPU's (same box or different box) over on this thread here:
http://drupal.org/node/1138098#comment-4483164
The cool thing is I can send out A LOT of http "pings" that could be used to tell any server what to do. I use streams instead of sockets; and I use the stream_select function to do the magic for me; magic being sending out 1000 concurrent unique http requests to it's self (127.0.0.1) in a little over 0.1 seconds; Apache only handled 250 of them due to it's configuration, and took some time to actually process them all (I sent requests for pages that would 404 on the drupal box, as these generate a watchdog event).
Anyway thought I would update everyone on this latest development. We now have a way to do background processing if both ends have been setup for it.
Followup
I think HTTPRL is ready for wider adoption. It allows for "threading" by using parallel http connections to the server. I've made it as easy to use as the syntax is very similar to call_user_func_array() and by returning references it makes the code a lot cleaner.
A simple example is to call menu_get_item in parallel with 3 connections. Each httprl_qcinp call will be ran in a new drupal process.
<?php
// Queue Callbacks to run In a New Process.
$items = array();
$items[] = &httprl_qcinp('menu_get_item', array('admin/structure'), array('domain_connections' => 3));
$items[] = &httprl_qcinp('menu_get_item', array('admin/content'), array('domain_connections' => 3));
$items[] = &httprl_qcinp('menu_get_item', array('admin/modules'), array('domain_connections' => 3));
$items[] = &httprl_qcinp('menu_get_item', array('admin/config'), array('domain_connections' => 3));
$items[] = &httprl_qcinp('menu_get_item', array('admin/reports'), array('domain_connections' => 3));
// Execute in parallel.
httprl_send_request();
// Echo out results.
echo '<pre>' . print_r($items, TRUE) . '</pre>';
?>
Similar to above but using httprl_batch_callback. This will run the callbacks in a batch. so structure & content get ran in the same process, modules & config, and reports. So 3 processes running a total of 5 commands.
<?php$menu_items = array(
'admin/structure',
'admin/content',
'admin/modules',
'admin/config',
'admin/reports',
);
$items = httprl_batch_callback('menu_get_item', $menu_items, array('multiple_helper' => TRUE));
// Echo out results.
echo '<pre>' . print_r($items, TRUE) . '</pre>';
?>
If you have a bulk bulk operation then httprl_batch_callback would be useful as well. httprl_batch_callback by default uses 3 threads (threads) with a batch size of 30 (max_batchsize). So this will be split up into 17 chunks and node_load_multiple will be fed 30 NIDs per thread with 3 threads running in parallel.
<?php// List of nodes to load; 1-500.
$nids = range(1, 500);
// Set options.
$options = array(
'global_timeout' => 300,
);
// Queue & Execute requests.
$results = httprl_batch_callback('node_load_multiple', $nids, $options);
// Echo out results.
echo '<pre>' . print_r($results, TRUE) . '</pre>';
?>
Enjoy :)