perusio's config vs Lynx

Garrett Albright's picture

Lynx is a command-line web browser which is often useful for testing things when you need something more functional than curl/wget but don't need a full-on graphical web browser (or perhaps when you want to read a web page without being distracted by pesky graphics, colors and text formatting). However, if you're using perusio's config on your server and you try to access it via lynx, it won't work; you'll get a "Alert! Unable to access document." error. Frustrating…

If you check the server logs, you'll see something like:

127.0.0.1 - - [19/Jan/2013:16:13:45 +0900] "GET / HTTP/1.0" 444 0 "-" "Lynx/2.8.7rel.2 libwww-FM/2.14 SSL-MM/1.4.1 OpenSSL/1.0.1c"

Hmm? Status code 444? What the heck is that? Quoth Wikipedia:

444 No Response (Nginx)
Used in Nginx logs to indicate that the server has returned no information to the client and closed the connection (useful as a deterrent for malware).

I grepped my config directory for "444" to see if there was something in my config that was triggering this. Sure enough, in the site config files (named something like example.com.conf), you'll see this:

## See the blacklist.conf file at the parent dir: /etc/nginx.
## Deny access based on the User-Agent header.
if ($bad_bot) {
    return 444;
}
## Deny access based on the Referer header.
if ($bad_referer) {
    return 444;
}

And in blacklist.conf:

## Add here all user agents that are to be blocked.
map $http_user_agent $bad_bot {
    default 0;
    libwww-perl                      1;
    ~(?i)(httrack|htmlparser|libwww) 1;
}

…And libwww happens to appear in the Lynx User-Agent string (Lynx/2.8.7rel.2 libwww-FM/2.14 SSL-MM/1.4.1 OpenSSL/1.0.1c on my box, but YMMV).

(If you're not following along, what's happening is that blacklist.conf is telling Nginx to set $bad_bot to 1 if the browser's User-Agent header string contains "libwww," which Lynx's does. Then the domain config file is telling Nginx to send a HTTP response with an empty body and a 444 status code if $bad_bot equals 1. Lynx doesn't know how to handle a response like that, so it shows that error.)

There are a number of ways to fix this. You can edit blacklist.conf and remove libwww from the list of banned strings in user agents (so the above would look like the following instead):

## Add here all user agents that are to be blocked.
map $http_user_agent $bad_bot {
    default 0;
    libwww-perl                      1;
    ~(?i)(httrack|htmlparser) 1;
}

Or you can edit your site config file and comment out or remove the bit which blocks access to "bad bots" (the three lines that begin with if ($bad_bot) {). Both will theoretically make your site less secure to malicious bots, but since it's trivial for software to fake a User-Agent header and pretend to be a "legitimate" browser, this alone won't exactly break your server wide open. (I know that if I were writing a malicious bot, I'd just have it use Internet Explorer's or Firefox's User-Agent header - a server is unlikely to be configured to block those.)

As an example of the above point, if altering your server config is not an option, you can also simply have Lynx use a different User-Agent header by using the -useragent flag: lynx -useragent="foo" example.com

Well, this post turned out longer than I intended, but hopefully it provides help to someone else in the future who might be stumped like this.

Comments

Well you could use the

perusio's picture

full UA string to override the $bad_bot stuff.

map $http_user_agent $bad_bot {
    default 0;
    ~*^Lynx 0;
    libwww-perl                       1;
    ~(?i)(httrack|libwww|htmlparser) 1;
}

I'll pushed to all the branches the mod.

Even though faking a UA header is trivial, there's a lot of low hanging fruit in spammers and similar fauna that don't even bother to change the UA header.

FIxed

perusio's picture

on all branches. Thanks.

Nginx

Group organizers

Group notifications

This group offers an RSS feed. Or subscribe to these personalized, sitewide feeds:

Hot content this week