Comparison of HTTP Client Libraries

You are viewing a wiki page. You are welcome to join the group and then edit it. Be bold!

Posted by mikeytown2 on May 24, 2012 at 1:49am
Last updated by christefano on Thu, 2014-08-07 00:35

Coming from this issue on d.o: Evaluate third party libraries to replace drupal_http_request() and the browser in DrupalWebTestCase

Definitions listed in order of importance (top is important).
Parallel Requests: Project uses curl_multi_select(), stream_select(), or socket_select() when issuing multiple requests.
Non Blocking Requests: Open connection; write to it; do not wait to read, instead close connection.
Callback: Run custom code once that connection is done; call_user_func() or call_user_func_array().
PSR-0: Does the project follow the PSR-0 standard.
Symfony compatible objects: Request/response objects that are compatible with the Symfony ones.
Pool/Throttle: Limit domain & total number of connections. If issuing 10k requests don't do them all at the same time.
Proxy Support: Can requests be tunneled through a proxy?
Cookie Parsing: Are cookies pulled out of the header?
Global Timeout: Max number of seconds the whole call can take (only matters for parallel).
Async Connection: When Opening a connection, do not block.
Complex SSL Logic: Verify peer (Require verification of SSL certificate used) & use of local certificates (Certificate Authority file).
Send Files: The ability to "upload" a file to a server (shows up in $_FILES).
FTP Connection: Get a file off of a FTP server.
Full HTTP 1.1 compliance: Follow all requirements of a http client.
Auto Encode Array Data: http_build_query() used on data structures.
Alter streams mid execution: Example - Request 20 urls & break after at least 5 return.
Set Read/Write Chunk Size: I needed to adjust the write chunk size when sending a lot of data to an IIS server.
Persistent connections: Are connections reused between requests.
Streaming bodies: Can the entity body of a request or response be streamed or does it need to be loaded in its entirety into a string.

Some other things to consider is test code, documentation, and example use cases. These are a little bit harder to compare though. The reason why I have Parallel, Non Blocking, and Callbacks at the top is for building a multi-process library on top of HTTP requests. Without these 3, the power a multi-process library is significantly reduced. One more thing to consider is all of the GitHub projects require cURL or will eventually require it.

Comparison of all 5 GitHub projects & HTTPRL are as follows:

Name	Buzz	Buzy	Guzzle	php-multi-curl	Rolling-Curl	HTTPRL
Parallel Requests	Yes	No	Yes	Yes	Yes	Yes
Non Blocking Requests	Unknown	Unknown	Yes	Yes	Yes[4]	Yes
Callback	Yes	Yes	Yes	No	Yes	Yes
PSR-0	Yes	Yes	Yes	No	No	No
Symfony	No	Yes	No	No	No	No
Pool/Throttle	No	No	Yes	No	Yes	Yes
Proxy	No	No	Yes	No	Yes[4]	Yes
Cookie	Yes	No/Maybe	Yes	No	No	Yes
Global Timeout	Yes	No	Yes	No	Yes	Yes
Async Connection	No	No	No	Unknown	No	Half Yes[1]
Complex SSL Logic	Half Yes[2]	No	Yes	No	Yes[4]	Yes
Send Files	Yes	No	Yes	No	Yes[4] - with @ in CURL POST Options	No
FTP	No	No	No	Unknown	No	No
HTTP 1.1	Yes	Yes	Yes	Yes	Yes	Almost[3]
Encode Array	Yes	No	Yes	No	Yes[4]	Yes
Alter Streams	Unknown	No	Yes	No	No	Yes
Chunk Size	Unknown	Unknown	Yes	Unknown	Yes[4]	Yes
License	MIT	MIT	MIT	MIT?	Apache	GPL
Persistent Connections	No	No	Yes	Unknown	No	No
Streaming Bodies	No	No	Yes	Yes	No	Yes

[1] - Blocks on DNS lookups
[2] - Can not set CURLOPT_CAINFO
[3] - Doesn't handle "100 Continue" correctly
[4] - Can be done by passing curl options directly in the request

Comments

Guzzle / alter streams?

Posted by mtdowling on May 24, 2012 at 2:39am

When would someone use the "alter streams mid execution" functionality? I can't think of a use case. Either way, I think it can be done in Guzzle by plugging into the Symfony2 event dispatcher plugin system.

I updated the matrix with some of Guzzle's features. Some of the functionality is provided by curl (e.g. proxies, SSL verification, chunk size).

You might be interested in the plugins offered by Guzzle when making your evaluation (HTTP based caching, cookies, exponential backoff, MD5 hash validation, Oauth 1.0, over the wire logging, history, batching, and mocking): https://github.com/guzzle/guzzle/tree/master/src/Guzzle/Http/Plugin

If you have any questions about Guzzle, please let me know.

-Michael

alter streams mid execution

Posted by mikeytown2 on May 24, 2012 at 3:06am

Idea for this feature came after writing this: http://groups.drupal.org/node/230698#comment-753618

Use cases:
- ESI/SSI page assembly in PHP. Have a list of required ESI resources that need to be delivered to the browser; let the browser get the rest (the slow ones) via AJAX calls. http://drupal.org/project/esi has AJXA fallback so something like this isn't a super crazy idea.
- Use different proxies to get content.
- Abort rest of requests if we need a 200 from all URLs to continue.
- A multi-process library could have certain things that need to be returned, and other things that are not necessary. Also, if one of the necessary components errors out, abort the rest of the requests.

Nice!

Posted by Crell on May 24, 2012 at 3:25am

Awesome, mikeytown, thanks!

Another factor: What's the license on those libraries? (MIT, BSD, LGPL, or GPL are a hard requirement.)

At first glance Guzzle and HTTPRL look like the strongest contenders, at least on paper.

About the only thing Buzy has going for it is that it uses the Symfony2 APIs, which would be good for DX but probably not enough to offset its relatively limited feature set.

HTTPRL is all procedural code, and with some frighteningly long functions. That makes it harder to test, and harder to leverage effectively since we cannot autoload it. It would have to be heavily refactored before it would be D8-ready.

php-multi-curl seems to mix singletons, OO, and procedural CURL. That's a bad sign in my book and makes me run in terror.

Other thoughts?

Thoughts

Posted by mikeytown2 on May 24, 2012 at 4:05am

One thing to remember is I wrote HTTPRL, so the comparison table will have some bias towards it. It's a wiki so help make the comparison more equal if anyone sees unnecessary bias :)

I've been slowly breaking up the long functions into smaller bits of code: http://drupal.org/node/1325662#comment-6023544 - This issue is trying to tackle DNS lookups being blocking... so far have 2 hooks so you can change connection from hostname to IP if you already know the IP for the host; or if you wish to distribute threads to different boxes; etc...

I do have an issue to change HTTPRL from D7 to D8 code: http://drupal.org/node/1593862 but why re-invent the wheel (unless we have to).

I all github projects are MIT or Apache.

No Apache

Posted by Crell on May 24, 2012 at 4:12am

Apache license isn't acceptable unless Drupal moves to GPLv3. I think we should, frankly, but at the moment we have not done so. If any are Apache licensed that needs to be clearly noted.

I don't claim to be an expert in any of the mentioned libraries so I am not the right person to correct for author bias. We should get more authors here than just Guzzle's. ;-) (Hi Michael! Thanks for stopping by!)

Apache may not be

Posted by SeanJA on May 24, 2012 at 11:46am

But the MIT license is compatible with the GPLv2. That is the benefit of dual licensing it.

Guzzle is incredibly

Posted by rbayliss on May 24, 2012 at 1:12pm

Guzzle is incredibly flexible, simple to use, and works well with Symfony components, even if it isn't built on Symfony (see Goutte). It's definitely got my vote.

Guzzle or HTTPRL

Posted by c-logemann on May 24, 2012 at 4:30pm

If the feature table is right I also think that Guzzle or HTTPRL are the best solutions. I just learned that "Complex SSL Logic" is possible and I like this idea very much. But more important is the proxy support I think.
Thanks to mikeytown2 for bringing HTTPRL already to drupal contrib.

--
My company: Nodegard GmbH

Persistent connections

Posted by mtdowling on May 24, 2012 at 9:30pm

I think another valuable metric not represented here is persistent connection management. This is something that Guzzle handles for both serial and parallel requests. Maybe the matrix should be updated with this data?

Default with cURL

Posted by mikeytown2 on May 24, 2012 at 11:46pm

All cURL connections are reused by default :)

The one thing that cURL can't do from what I've read is async HTTP. Some people report that it might be possible, so this needs more testing. If cURL can not do async, I recommend using stream_socket_client() with stream_select(); this code path has been thoroughly tested in AdvAgg and HTTPRL.

mtdowling Would you be willing to get async functionality in Guzzle? And if curl can't do it would you accept in the minimum amount of HTTPRL like code in Guzzle to get it working?

That's true, but...

Posted by mtdowling on May 25, 2012 at 12:42am

Yes, if you reuse a curl handle (correctly), you will benefit from persistent connections. Not every library reuses a curl handle or attempts to pool them. You need reuse a handle and ensure that your handle is not polluted by options that can never be unset (e.g. range headers and timeouts). When a handle is polluted with one of these options, you should throw away the handle and create a new curl handle. Further, connection caches are not reused between multi and easy handles.

There are several potential ways to get curl to bail on a response. I think a plugin could be created for this to enable async PUT/POST requests to be sent that does not wait until a response is received. This could possibly work for non entity enclosing requests (e.g. GET), but I don't see the point.

Basically, you would tap into the event fired from the curl progress callback:

The plugin receives the amount of data uploaded by curl.
When the amount of data uploaded == the amount of data to upload OR any data has been downloaded, set a 1ms timeout on the handle
A timeout exception is thrown, and this could be caught by the plugin.

I've tested this, and it appears to work with requests with both Content-Length and chunked Transfer-Encoding. I would just need to create the plugin.

What is the use case for "asynchronous" HTTP calls (seems like that's the wrong word for this)?

use case for asynchronous HTTP calls

Posted by mikeytown2 on May 25, 2012 at 1:17am

There are a lot of use cases for async HTTP. Generation of CSS/JS Aggregates, Imagecache Presets, anything else that creates a file when a special URL is hit. When you put a threading library on top of HTTP then things like flushing/regenerating a cache in the background is possible, sending off emails, doing expensive database operations, various cron tasks, etc... it opens up the doors to a lot of different possibilities.

I actually built HTTPRL after having coded up async code in 3 different module at one point or another (boost, advagg, imageinfo_cache) and I got tired of doing it again and again and decided to make a library out of it. There is a reason async is #2 on the priority list of mine, it is an extremely powerful tool.

Edit: As for the terminology behind this, I call it non blocking in HTTPRL (I didn't like async either).

Job queue system

Posted by mtdowling on May 25, 2012 at 1:28am

I think a job queue system like Gearman is a more appropriate solution for this sort of thing, but this is possible in curl using the plugin method I described.

shared hosting

Posted by mikeytown2 on May 25, 2012 at 1:32am

I don't think gearman works on most cheap shared hosting accounts... that's the issue in short.

The other thing is for CSS/JS aggregates and imagecache presets the timing needs to be now; 10 seconds later is too late for these 2 use cases.

Gearman is a pain to compile

Posted by ebeyrent on May 25, 2012 at 11:32am

Gearman is a pain to compile on RHEL/CentOS (boost141 is problematic), and the RPMs aren't up to date. Also, it doesn't work on Windows.

RabbitMQ or ActiveMQ might be better alternatives.

Uh

Posted by chx on May 26, 2012 at 12:28pm

That does not work in core. We learned our lesson already: core can not issue HTTP requests to the same server. It fails in interesting ways on various hosts.

Curious

Posted by mtdowling on May 26, 2012 at 4:48pm

I'm not used to this message board yet, so I don't know who you're replying to. If you're saying that core can't use persistent HTTP connections when sending requests, then I'm curious why you think that.

Persistent connection handling in curl is pretty bullet-proof. It seems like a good thing to keep enabled by default, but allow users who are working with servers that may have known issues with persistent connections to disable them (which is possible in Guzzle). I'd be happy to test persistent connection handling with any example servers you can provide to see if curl can handle it by default (maybe it falls back to HTTP/1.1 for example).

There's background in this

Posted by catch on May 27, 2012 at 12:03pm

There's background in this issue: http://drupal.org/node/965078 as well as http://drupal.org/node/245990

The short version is Drupal used to (actually still does since that patch isn't committed yet) test it's own ability to make http requests in general, by making http requests to itself. However that's disallowed on some servers, and on others it will work while external requests won't, while still other hosts were restricted to a single thread so it was impossible for the request doing the testing to ever complete, and etc. I don't think this means we can't ever do this (otherwise we couldn't have simpletest module in core) though, but it does mean core shouldn't make those requests by default as it currently does.

Async requests are now in Guzzle

Posted by mtdowling on May 25, 2012 at 3:24am

This is probably about as close as it will get to truly asynchronous/non-blocking with curl: https://github.com/guzzle/guzzle/commit/6dfd3d280949721342655e08b1ddef88.... It's still in a branch as I'd like to test it a little more and get feedback.

When all of the data that should be uploaded is sent, the plugin I created sets a 1ms timeout on the request transfer and tells curl not to download the body of the response. An X-Guzzle-Async header is added to the response of the request when it completes or times out. If the connection to the server is extremely fast and completes in <1ms, then the request will actually receive a response from the server, but curl doesn't download the response body due to the addition of a CURLOPT_NOBODY option in the progress function.

What do you think?

Thanks!

Posted by mikeytown2 on May 25, 2012 at 8:53pm

Thanks for taking the time to implement this feature request :)

Why does third parties?

Posted by tassoman on May 25, 2012 at 7:51am

Why running outside Symfony2 framework? Browserkit + DomCrawler + phpUnit could could do all the stuff without anything more, unit testing included.

I'm not good at testing but, theorically you can write tests for all the features you listed and get back red/green results.

Next phase

Posted by mikeytown2 on May 25, 2012 at 9:28pm

The next phase (start beginning of next month) will see how the projects work with Oddball URLs that hass gave to me in the HTTPRL issue queue. He has a nice selection due to him being the maintainer of Link checker.

List in no particular order

// Redirects to a HOST that does not exists.
'http://www.bma.bund.de/index.cfm?8AC792DB077C4AB5BDB675A52577F0BE'

// multi level redirect.
'http://www.apple.com/qtactivex/qtplugin.cab'

// Sensitive to HTTP version number not being a float
'http://www.technikmuseen.de/'
'http://www.neuland.com/sucht.htm'

// Should give a connection refused fairly quickly.
'http://www.jugendarbeit.gmxhome.de'

// Self signed cert
'https://www.fh-muenster.de/FB10/weiterbildung.htm'

// Issues with async connect
'http://www.profamilia.de/'

// More URLs that have odd behavior.
'http://www.fh-muenster.de/FB10/weiterbildung.htm'
'http://www.science-tech.nmstc.ca/english/index.cfm'
'http://www.aviation.nmstc.ca/Eng/english_home.html'
'http://home.sunrise.ch/kleinera/tourismus-museum/'
'http://www.paritaet.org/gv/infothek/pid/'
'http://www.fabienne-iaf.de'
'http://www.jugendarbeit.gmxhome.de'
'http://www.fr-aktuell.de/uebersicht/alle_serien/regional/berufsbilder'
'http://infomed.mds-ev.de/sindbad_frame'
'http://www.careandhealth.com/'
'http://comm-org.utoledo.edu'
'http://www.zeva.uni-hannover.de'
'http://www2.imj.org.il/eng/branches/rockefeller/final/index.html'

Hopefully some of the raw error numbers match with the numbers here, but this is mainly testing cURL at this point. http://msdn.microsoft.com/en-us/library/aa924071.aspx

One I particularly hate

Posted by chx on May 26, 2012 at 12:08am

You can yourself easily generate: a page where the content-length does not match the length of data being sent.

Added information about rolling curl

Posted by fabianx on May 27, 2012 at 7:31pm

As the creator of the github fork, I quickly added the information for rolling curl to the table.

It turns out that many things can be done by passing CURL options in the request directly.

The biggest shortcoming of rolling curl is the missing PSR-0 and persistent http connections.

Best Wishes,

Fabian

I've been poking at Guzzle

Posted by Grayside on May 29, 2012 at 5:40am

I've been poking at Guzzle some since a tweet about Goutte brought it to my attention. Good looking code, very thorough.

It's pretty trivial to create Drupal integration plugins for Logging and Caching. Kind of noticeable compared to the plugins in the Guzzle repository that we are neither PSR-0 nor object-oriented ;)

I thought about providing a DX assessment, but this was also my first heavy brush with the implications of PSR-0 and Symfony's Event component. My evaluation focus was on using it to build a REST API client library, for which it seems well-suited to providing capabilities while staying out of the way.

Been Busy

Posted by mikeytown2 on June 11, 2012 at 8:09pm

Hey all I've been a little busy with work and life so I haven't had a chance to take these different libraries for a test drive. If someone else wants to do a comparison of the different libraries that would be helpful. On paper I would vote for Guzzle, but I haven't put it through it's paces yet (so I can't fully recommend it). If using Guzzle we also need to look into Goutte.

I'm pretty sure Guzzle would

Posted by glennpratt on July 18, 2012 at 6:42pm

I'm pretty sure Guzzle would qualify for at least Half Async, but I haven't tested whether it blocks on DNS or not...

http://guzzlephp.org/api/class-Guzzle.Http.Plugin.AsyncPlugin.html

https://github.com/guzzle/guzzle/pull/62

https://github.com/guzzle/guzzle/blob/master/src/Guzzle/Http/Plugin/Asyn...

FWIW, I'm using it in a payment gateway library and pretty happy with it.

mtdowling added that in

Posted by mikeytown2 on July 18, 2012 at 7:26pm

http://groups.drupal.org/node/233173#comment-759158
I called this "Non Blocking Requests"

What I'm talking about in that grid (Async Connection) has to do with getting the IP address from DNS and opening the TCP connection. HTTPRL will open the TCP connection in the background; thus it's half async. The DNS issue is a hard one, even google is having trouble solving it. If an HTTP client is all I was working on I could get DNS lookups to be non blocking as well, but that would require similar code of HTTPRL but for the DNS protocol. Take this project http://code.google.com/p/netdns2/ and make it use stream_select() and the DNS lookup would be non blocking. But creating the ultimate HTTP client on top of stream_select() is not my day job. The TCP connection issue is buried inside of curl and I'm not sure how to enable it (tcp-nodelay?). CURL async DNS is based off of curl compile options: http://stackoverflow.com/questions/6012617/how-enable-curls-asynchdns. Both of these are based off of your system and thus can not be directly controlled from PHP.

In case anyone is wondering the patch to get guzzle in drupal has passed the test bot, so it's just a matter of time until it gets in :)