Coming from this issue on d.o: Evaluate third party libraries to replace drupal_http_request() and the browser in DrupalWebTestCase
Definitions listed in order of importance (top is important).
Parallel Requests: Project uses curl_multi_select(), stream_select(), or socket_select() when issuing multiple requests.
Non Blocking Requests: Open connection; write to it; do not wait to read, instead close connection.
Callback: Run custom code once that connection is done; call_user_func() or call_user_func_array().
PSR-0: Does the project follow the PSR-0 standard.
Symfony compatible objects: Request/response objects that are compatible with the Symfony ones.
Pool/Throttle: Limit domain & total number of connections. If issuing 10k requests don't do them all at the same time.
Proxy Support: Can requests be tunneled through a proxy?
Cookie Parsing: Are cookies pulled out of the header?
Global Timeout: Max number of seconds the whole call can take (only matters for parallel).
Async Connection: When Opening a connection, do not block.
Complex SSL Logic: Verify peer (Require verification of SSL certificate used) & use of local certificates (Certificate Authority file).
Send Files: The ability to "upload" a file to a server (shows up in $_FILES).
FTP Connection: Get a file off of a FTP server.
Full HTTP 1.1 compliance: Follow all requirements of a http client.
Auto Encode Array Data: http_build_query() used on data structures.
Alter streams mid execution: Example - Request 20 urls & break after at least 5 return.
Set Read/Write Chunk Size: I needed to adjust the write chunk size when sending a lot of data to an IIS server.
Persistent connections: Are connections reused between requests.
Streaming bodies: Can the entity body of a request or response be streamed or does it need to be loaded in its entirety into a string.
Some other things to consider is test code, documentation, and example use cases. These are a little bit harder to compare though. The reason why I have Parallel, Non Blocking, and Callbacks at the top is for building a multi-process library on top of HTTP requests. Without these 3, the power a multi-process library is significantly reduced. One more thing to consider is all of the GitHub projects require cURL or will eventually require it.
Comparison of all 5 GitHub projects & HTTPRL are as follows:
| Name | Buzz | Buzy | Guzzle | php-multi-curl | Rolling-Curl | HTTPRL |
| Parallel Requests | Yes | No | Yes | Yes | Yes | Yes |
| Non Blocking Requests | Unknown | Unknown | Yes | Yes | Yes[4] | Yes |
| Callback | Yes | Yes | Yes | No | Yes | Yes |
| PSR-0 | Yes | Yes | Yes | No | No | No |
| Symfony | No | Yes | No | No | No | No |
| Pool/Throttle | No | No | Yes | No | Yes | Yes |
| Proxy | No | No | Yes | No | Yes[4] | Yes |
| Cookie | Yes | No/Maybe | Yes | No | No | Yes |
| Global Timeout | Yes | No | Yes | No | Yes | Yes |
| Async Connection | No | No | No | Unknown | No | Half Yes[1] |
| Complex SSL Logic | Half Yes[2] | No | Yes | No | Yes[4] | Yes |
| Send Files | Yes | No | Yes | No | Yes[4] - with @ in CURL POST Options | No |
| FTP | No | No | No | Unknown | No | No |
| HTTP 1.1 | Yes | Yes | Yes | Yes | Yes | Almost[3] |
| Encode Array | Yes | No | Yes | No | Yes[4] | Yes |
| Alter Streams | Unknown | No | Yes | No | No | Yes |
| Chunk Size | Unknown | Unknown | Yes | Unknown | Yes[4] | Yes |
| License | MIT | MIT | MIT | MIT? | Apache | GPL |
| Persistent Connections | No | No | Yes | Unknown | No | No |
| Streaming Bodies | No | No | Yes | Yes | No | Yes |
[1] - Blocks on DNS lookups
[2] - Can not set CURLOPT_CAINFO
[3] - Doesn't handle "100 Continue" correctly
[4] - Can be done by passing curl options directly in the request
Comments
Guzzle / alter streams?
When would someone use the "alter streams mid execution" functionality? I can't think of a use case. Either way, I think it can be done in Guzzle by plugging into the Symfony2 event dispatcher plugin system.
I updated the matrix with some of Guzzle's features. Some of the functionality is provided by curl (e.g. proxies, SSL verification, chunk size).
You might be interested in the plugins offered by Guzzle when making your evaluation (HTTP based caching, cookies, exponential backoff, MD5 hash validation, Oauth 1.0, over the wire logging, history, batching, and mocking): https://github.com/guzzle/guzzle/tree/master/src/Guzzle/Http/Plugin
If you have any questions about Guzzle, please let me know.
-Michael
alter streams mid execution
Idea for this feature came after writing this: http://groups.drupal.org/node/230698#comment-753618
Use cases:
- ESI/SSI page assembly in PHP. Have a list of required ESI resources that need to be delivered to the browser; let the browser get the rest (the slow ones) via AJAX calls. http://drupal.org/project/esi has AJXA fallback so something like this isn't a super crazy idea.
- Use different proxies to get content.
- Abort rest of requests if we need a 200 from all URLs to continue.
- A multi-process library could have certain things that need to be returned, and other things that are not necessary. Also, if one of the necessary components errors out, abort the rest of the requests.
Nice!
Awesome, mikeytown, thanks!
Another factor: What's the license on those libraries? (MIT, BSD, LGPL, or GPL are a hard requirement.)
At first glance Guzzle and HTTPRL look like the strongest contenders, at least on paper.
About the only thing Buzy has going for it is that it uses the Symfony2 APIs, which would be good for DX but probably not enough to offset its relatively limited feature set.
HTTPRL is all procedural code, and with some frighteningly long functions. That makes it harder to test, and harder to leverage effectively since we cannot autoload it. It would have to be heavily refactored before it would be D8-ready.
php-multi-curl seems to mix singletons, OO, and procedural CURL. That's a bad sign in my book and makes me run in terror.
Other thoughts?
Thoughts
One thing to remember is I wrote HTTPRL, so the comparison table will have some bias towards it. It's a wiki so help make the comparison more equal if anyone sees unnecessary bias :)
I've been slowly breaking up the long functions into smaller bits of code: http://drupal.org/node/1325662#comment-6023544 - This issue is trying to tackle DNS lookups being blocking... so far have 2 hooks so you can change connection from hostname to IP if you already know the IP for the host; or if you wish to distribute threads to different boxes; etc...
I do have an issue to change HTTPRL from D7 to D8 code: http://drupal.org/node/1593862 but why re-invent the wheel (unless we have to).
I all github projects are MIT or Apache.
No Apache
Apache license isn't acceptable unless Drupal moves to GPLv3. I think we should, frankly, but at the moment we have not done so. If any are Apache licensed that needs to be clearly noted.
I don't claim to be an expert in any of the mentioned libraries so I am not the right person to correct for author bias. We should get more authors here than just Guzzle's. ;-) (Hi Michael! Thanks for stopping by!)
Apache may not be
But the MIT license is compatible with the GPLv2. That is the benefit of dual licensing it.
Guzzle is incredibly
Guzzle is incredibly flexible, simple to use, and works well with Symfony components, even if it isn't built on Symfony (see Goutte). It's definitely got my vote.
Guzzle or HTTPRL
If the feature table is right I also think that Guzzle or HTTPRL are the best solutions. I just learned that "Complex SSL Logic" is possible and I like this idea very much. But more important is the proxy support I think.
Thanks to mikeytown2 for bringing HTTPRL already to drupal contrib.
--
My company: Nodegard GmbH
Persistent connections
I think another valuable metric not represented here is persistent connection management. This is something that Guzzle handles for both serial and parallel requests. Maybe the matrix should be updated with this data?
Default with cURL
All cURL connections are reused by default :)
The one thing that cURL can't do from what I've read is async HTTP. Some people report that it might be possible, so this needs more testing. If cURL can not do async, I recommend using stream_socket_client() with stream_select(); this code path has been thoroughly tested in AdvAgg and HTTPRL.
mtdowling Would you be willing to get async functionality in Guzzle? And if curl can't do it would you accept in the minimum amount of HTTPRL like code in Guzzle to get it working?
That's true, but...
Yes, if you reuse a curl handle (correctly), you will benefit from persistent connections. Not every library reuses a curl handle or attempts to pool them. You need reuse a handle and ensure that your handle is not polluted by options that can never be unset (e.g. range headers and timeouts). When a handle is polluted with one of these options, you should throw away the handle and create a new curl handle. Further, connection caches are not reused between multi and easy handles.
There are several potential ways to get curl to bail on a response. I think a plugin could be created for this to enable async PUT/POST requests to be sent that does not wait until a response is received. This could possibly work for non entity enclosing requests (e.g. GET), but I don't see the point.
Basically, you would tap into the event fired from the curl progress callback:
I've tested this, and it appears to work with requests with both Content-Length and chunked Transfer-Encoding. I would just need to create the plugin.
What is the use case for "asynchronous" HTTP calls (seems like that's the wrong word for this)?
use case for asynchronous HTTP calls
There are a lot of use cases for async HTTP. Generation of CSS/JS Aggregates, Imagecache Presets, anything else that creates a file when a special URL is hit. When you put a threading library on top of HTTP then things like flushing/regenerating a cache in the background is possible, sending off emails, doing expensive database operations, various cron tasks, etc... it opens up the doors to a lot of different possibilities.
I actually built HTTPRL after having coded up async code in 3 different module at one point or another (boost, advagg, imageinfo_cache) and I got tired of doing it again and again and decided to make a library out of it. There is a reason async is #2 on the priority list of mine, it is an extremely powerful tool.
Edit: As for the terminology behind this, I call it non blocking in HTTPRL (I didn't like async either).
Job queue system
I think a job queue system like Gearman is a more appropriate solution for this sort of thing, but this is possible in curl using the plugin method I described.
shared hosting
I don't think gearman works on most cheap shared hosting accounts... that's the issue in short.
The other thing is for CSS/JS aggregates and imagecache presets the timing needs to be now; 10 seconds later is too late for these 2 use cases.
Gearman is a pain to compile
Gearman is a pain to compile on RHEL/CentOS (boost141 is problematic), and the RPMs aren't up to date. Also, it doesn't work on Windows.
RabbitMQ or ActiveMQ might be better alternatives.
Uh
That does not work in core. We learned our lesson already: core can not issue HTTP requests to the same server. It fails in interesting ways on various hosts.
Curious
I'm not used to this message board yet, so I don't know who you're replying to. If you're saying that core can't use persistent HTTP connections when sending requests, then I'm curious why you think that.
Persistent connection handling in curl is pretty bullet-proof. It seems like a good thing to keep enabled by default, but allow users who are working with servers that may have known issues with persistent connections to disable them (which is possible in Guzzle). I'd be happy to test persistent connection handling with any example servers you can provide to see if curl can handle it by default (maybe it falls back to HTTP/1.1 for example).
There's background in this
There's background in this issue: http://drupal.org/node/965078 as well as http://drupal.org/node/245990
The short version is Drupal used to (actually still does since that patch isn't committed yet) test it's own ability to make http requests in general, by making http requests to itself. However that's disallowed on some servers, and on others it will work while external requests won't, while still other hosts were restricted to a single thread so it was impossible for the request doing the testing to ever complete, and etc. I don't think this means we can't ever do this (otherwise we couldn't have simpletest module in core) though, but it does mean core shouldn't make those requests by default as it currently does.
Async requests are now in Guzzle
This is probably about as close as it will get to truly asynchronous/non-blocking with curl: https://github.com/guzzle/guzzle/commit/6dfd3d280949721342655e08b1ddef88.... It's still in a branch as I'd like to test it a little more and get feedback.
When all of the data that should be uploaded is sent, the plugin I created sets a 1ms timeout on the request transfer and tells curl not to download the body of the response. An X-Guzzle-Async header is added to the response of the request when it completes or times out. If the connection to the server is extremely fast and completes in <1ms, then the request will actually receive a response from the server, but curl doesn't download the response body due to the addition of a CURLOPT_NOBODY option in the progress function.
What do you think?
Thanks!
Thanks for taking the time to implement this feature request :)
Why does third parties?
Why running outside Symfony2 framework? Browserkit + DomCrawler + phpUnit could could do all the stuff without anything more, unit testing included.
I'm not good at testing but, theorically you can write tests for all the features you listed and get back red/green results.
Next phase
The next phase (start beginning of next month) will see how the projects work with Oddball URLs that hass gave to me in the HTTPRL issue queue. He has a nice selection due to him being the maintainer of Link checker.
List in no particular order
// Redirects to a HOST that does not exists.
'http://www.bma.bund.de/index.cfm?8AC792DB077C4AB5BDB675A52577F0BE'
// multi level redirect.
'http://www.apple.com/qtactivex/qtplugin.cab'
// Sensitive to HTTP version number not being a float
'http://www.technikmuseen.de/'
'http://www.neuland.com/sucht.htm'
// Should give a connection refused fairly quickly.
'http://www.jugendarbeit.gmxhome.de'
// Self signed cert
'https://www.fh-muenster.de/FB10/weiterbildung.htm'
// Issues with async connect
'http://www.profamilia.de/'
// More URLs that have odd behavior.
'http://www.fh-muenster.de/FB10/weiterbildung.htm'
'http://www.science-tech.nmstc.ca/english/index.cfm'
'http://www.aviation.nmstc.ca/Eng/english_home.html'
'http://home.sunrise.ch/kleinera/tourismus-museum/'
'http://www.paritaet.org/gv/infothek/pid/'
'http://www.fabienne-iaf.de'
'http://www.jugendarbeit.gmxhome.de'
'http://www.fr-aktuell.de/uebersicht/alle_serien/regional/berufsbilder'
'http://infomed.mds-ev.de/sindbad_frame'
'http://www.careandhealth.com/'
'http://comm-org.utoledo.edu'
'http://www.zeva.uni-hannover.de'
'http://www2.imj.org.il/eng/branches/rockefeller/final/index.html'
Hopefully some of the raw error numbers match with the numbers here, but this is mainly testing cURL at this point. http://msdn.microsoft.com/en-us/library/aa924071.aspx
One I particularly hate
You can yourself easily generate: a page where the content-length does not match the length of data being sent.
Added information about rolling curl
As the creator of the github fork, I quickly added the information for rolling curl to the table.
It turns out that many things can be done by passing CURL options in the request directly.
The biggest shortcoming of rolling curl is the missing PSR-0 and persistent http connections.
Best Wishes,
Fabian
I've been poking at Guzzle
I've been poking at Guzzle some since a tweet about Goutte brought it to my attention. Good looking code, very thorough.
It's pretty trivial to create Drupal integration plugins for Logging and Caching. Kind of noticeable compared to the plugins in the Guzzle repository that we are neither PSR-0 nor object-oriented ;)
I thought about providing a DX assessment, but this was also my first heavy brush with the implications of PSR-0 and Symfony's Event component. My evaluation focus was on using it to build a REST API client library, for which it seems well-suited to providing capabilities while staying out of the way.
Been Busy
Hey all I've been a little busy with work and life so I haven't had a chance to take these different libraries for a test drive. If someone else wants to do a comparison of the different libraries that would be helpful. On paper I would vote for Guzzle, but I haven't put it through it's paces yet (so I can't fully recommend it). If using Guzzle we also need to look into Goutte.
I'm pretty sure Guzzle would
I'm pretty sure Guzzle would qualify for at least Half Async, but I haven't tested whether it blocks on DNS or not...
http://guzzlephp.org/api/class-Guzzle.Http.Plugin.AsyncPlugin.html
https://github.com/guzzle/guzzle/pull/62
https://github.com/guzzle/guzzle/blob/master/src/Guzzle/Http/Plugin/Asyn...
FWIW, I'm using it in a payment gateway library and pretty happy with it.
mtdowling added that in
http://groups.drupal.org/node/233173#comment-759158
I called this "Non Blocking Requests"
What I'm talking about in that grid (Async Connection) has to do with getting the IP address from DNS and opening the TCP connection. HTTPRL will open the TCP connection in the background; thus it's half async. The DNS issue is a hard one, even google is having trouble solving it. If an HTTP client is all I was working on I could get DNS lookups to be non blocking as well, but that would require similar code of HTTPRL but for the DNS protocol. Take this project http://code.google.com/p/netdns2/ and make it use stream_select() and the DNS lookup would be non blocking. But creating the ultimate HTTP client on top of stream_select() is not my day job. The TCP connection issue is buried inside of curl and I'm not sure how to enable it (tcp-nodelay?). CURL async DNS is based off of curl compile options: http://stackoverflow.com/questions/6012617/how-enable-curls-asynchdns. Both of these are based off of your system and thus can not be directly controlled from PHP.
In case anyone is wondering the patch to get guzzle in drupal has passed the test bot, so it's just a matter of time until it gets in :)