Node statistics for busy sites with Pixel Ping

We encourage users to post events happening in the community to the community events group on https://www.drupal.org.
mshmsh5000's picture

Once you put a Drupal app behind layers of external cache, the idea of "most popular" or "most viewed" becomes hard to compute. Boost's excellent boost_stats.php callback tries solves this in some situations by doing a partial bootstrap to access the DB and increment node_counter. But there are at least two potential complications to boost_stats.php with external cache.

  1. If your site lives behind external cache (e.g., Akamai), the callback paths will get cached. You need to update the CDN config so that it doesn't cache this path, or alter the request (e.g., append a random string to the callback) so that it busts through the cache layers and accesses the Drupal app directly.
  2. If your site is behind external cache because it sustains high traffic, the last thing you want to do is to cause a database write with every content (page, story, etc.) page load.

I found some inspiration with the recently-announced Pixel Ping, a lightweight pixel-tracker that runs on node.js. The attraction here is that you can run this service from a different (non-cached) domain, load a pixel with each content page load, and have Pixel Ping flush its stats periodically via a REST callback.

Instead of writing to the database on every page view, you can write every, say, 60 seconds. And instead of dealing with your external cache config, you can run this somewhere else entirely.

I have the first draft of a pixel_ping module that provides this integration -- a block to load the pixel, and a REST callback to update the node_counter statistics. The callback reuses code from Boost's boost_stats.php, but deals in batches of statistics instead of individual node_loads.

While I work on building out the configuration options for this, and before I submit a module proposal to d.o, wondering if any of you have caveats, feature requests, or other insight to share.

Comments

access log

mikeytown2's picture

I would offer support for the access log as well. Glad you found my file useful. If I where you, I would make the backend interchangeable/configurable; so other setups that may not want to run node.js can still take advantage of stats without having to install boost. It's something I've been meaning to do, just been too busy with other things. Oh and there is a slight bit of legacy code in the boost_stats.php file; it could be done slightly better now that I see the end from the beginning.

Good idea.

mshmsh5000's picture

If I go down the road of decoupling the functionality from the node.js solution (which is a good idea, thanks for that), and provide a standalone method a la boost_stats.php, I'm wondering what the most useful solution would be for everyone. A tracker pixel? An AJAX callback like boost_stats.php? Or both?

For the sites I'm working with, there's one of only two situations: there's a CDN in place, in which case you could configure the CDN to ignore a pixel or AJAX callback path; or there's no CDN in place, in which case traffic is sufficiently low that we're not worried about the native node statistics load. So I'm hoping to hear from others what the non-Pixel Ping use cases would look like.

I'll add in optional access log support too. Thanks for the suggestions.

Maybe pixel_ping isn't the best name for this module, if you don't have to use Pixel Ping with it!

Reason for ajax

mikeytown2's picture

Reason there is an ajax path is so you can return the stats block from core. Now that there are things like
http://drupal.org/project/ajaxify_regions
http://drupal.org/project/ajaxblocks
the need for an AJAX callback is not as necessary. I would opt for the 1px image.

Referrer

mikeytown2's picture

Forgot about getting the referrer... I think you could still pass that in a get parameter of the image. The other thing to think about is the session entry in the accesslog. I'm sure what I have for that could be improved upon.

Use Google Analytics API?

Vacilando's picture

I've long thought that the best route for this kind of stats would be to employ Google Analytics. It is lightweight and constantly improved. It gets loaded on each page - whether it is generated, Boost-ed, or served from a CDN. If only page view count is needed, why not extracting it using Google Analytics Data Export API? No extra JavaScript, database, etc. Just extract the count when needed. There already is a module whose functions could be re-used for that -- http://drupal.org/project/google_analytics_api. What do you think?


---
Tomáš J. Fülöpp
http://twitter.com/vacilandois

Excellent and very much feasible

deepesh's picture

I would second that suggestion, Google Analytics is already loading doing the hard-work so why bother wasting computing resources ? the only negative effect would be that it wont track users blocking GA using blockers like Adblock !

Sometimes its useful to see

heavy_engineer's picture

Sometimes its useful to see whats happening in real time. Using something like pixel ping with the right GET parameters, you could watch various pages in a lovely jquery spy from a db that has nothing to do with the live service.

Obviously, if you dont care about realtime, then GA is great.

Systems architecture and Drupal development - www.initsix.co.uk

Google Analytics Counter

Vacilando's picture

See Google Analytics Counter -- it works quite nicely, though there is a small problem with respect to access by anonymous users. The node_counter is not updated, but I guess could be done easily (if anybody share an outline of the task I am quite happy to implement it presently). Of course, Google Analytics only registers hits from users who have JavaScript enabled and who did not disable GA tracking.

UPDATE: As of version 6.x-1.2 the module also supports Views display of the GA Counter values for nodes.


---
Tomáš J. Fülöpp
http://twitter.com/vacilandois

Imagine a world...

mshmsh5000's picture

...where you can't just install Google Analytics on every site where you want accurate node_counter info. Imagine, for instance, you work at a big company where there may be some reluctance to get GA running on a particular site because GA data may conflict with other data that's already being gathered via other services, and the resulting conflict may cause headaches.

That world exists. If GA isn't an option, then the GA module isn't an option. That's one issue.

Another issue, which may be solved by the GA data export module, is that you can have an indefinite number of URL variations pointing to the same node. Does this module filter out GET parameters and duplicate path aliases to arrive at consistent data? (And does that require fewer cycles on the Drupal side than the Pixel Ping direction?)

And does it update the node_counter table? That's the key. I should be able to construct a view of articles, for instance, that sorts by popularity. That requires a complete set of node_counter data.

I'd answer these questions myself, but I'd rather see whether anyone's actually used this module for what we're discussing in this thread.

Prevent Forged Hits

mikeytown2's picture

One interesting thing you could try to do is prevent forged hits to the node_counter table via the access logs session variable. If your only writing to the database once a minute you could have use the session as the hash for that nodes counter. Example data array

nodes
  nid-1
    SESS-A
    SESS-B
    SESS-C
  nid-2
    SESS-A
    SESS-C

This way if SESS-C hits the same node 100 times a minute it will only count as 1 view.

Sessions

mshmsh5000's picture

This would work if session info were being carried through from the pixel request back to Drupal, but Pixel Ping is serving the pixel, and has no concept of session (one reason it's so lightweight). And if this is running on Pressflow, there's no default session for anonymous users anyway, so you couldn't append that info to the pixel request and have it passed back to Drupal during the data flush. Drupal receives the once-per-x-seconds flush from Pixel Ping with only node IDs and hit counts.

I'm going to put up the version I have now, which is simple and works great with Pixel Ping. The next stage will be to implement your suggestion of a Drupal-served pixel as a module option, where we may incur per-page DB writes but also may have more options in terms of, e.g., session tracking (if the session exists -- or, with PF/D7, you could force-start a session).

pseudo session

mikeytown2's picture

Long story short, if the user doesn't have a session I create a pseudo session; it's not the best but it works for 95% of the use cases. And because it's only used for the access log I don't have to worry about stolen sessions etc. This sends nothing back to the user.

<?php
$session_id
= md5($_SERVER['HTTP_USER_AGENT'] . ip_address());
?>

boost_stats_init() has this logic if your looking for it; inside boost_stats.php

Think both solution are good

bennos's picture

Think both solution are good ideas.
Pixel Pong is small and great way to track content hits outside the page.
In cases where content is distributed via RSS or under CC license, the Pixelpong tracker is great was to track impressions all over the world.

I also love the way to use Analytics API like mentioned above.

I would test both modules on different setups.

First version available on GitHub

mshmsh5000's picture

There's a lot left to do, including a lot of the good suggestions from this conversation, but the initial module that handles Pixel Ping integration is here:

http://github.com/mshmsh5000/pixel_pusher

Let me know if anyone tries it out. I should be able to do some thorough load testing in the coming weeks.

Tryin it out

ccshannon's picture

Hey Matt,

Thanks for doing this.

I've got this running on my laptop. I'm not 100% clear on exactly what node.js and pixel ping truly "do", but I have the pixel ping running in a Terminal window, flushing every 5 seconds, receiving calls from a local Drupal instance with one node in it.

Looks like this.

1:    node,add
--- flushed ---
--- flushed ---
--- flushed ---
1: node,add
--- flushed ---
--- flushed ---
--- flushed ---
1: node,1
--- flushed ---
--- flushed ---

I understand that it doesn't record to the node_counter, yet, but is there any current relevance to the 'endpoint' setting in the config.json?

I gave endpoint a URL based on my MAMP and VirtualHost settings, so instead of 'http://example.com/pixel_pusher/save_hits' I have it set to 'http://dev.lh:8888/pixel_pusher/save_hits' (dev.lh is just my laptop's Doc Root) but of course I get a 404 when visiting that address. Should it instead point to my Drupal instance's subdomain?

FYI, the main reason I need this functionality is that our sites are running behind Varnish caching, and our 'Most Popular' blocks are not working, because the stat counter is not recording all the hits, so really old content (pre-Varnish) appears to have more hits than new content.

Drupal 7 version

fangel's picture

Hi Matt..

I've created a Drupal 7 port of your module.. https://github.com/fangel/pixel_pusher (I've also created a pull-request on GitHub so you can pull it into a d7-branch of your repo)

-Morten

d.o. module?

valthebald's picture

Are you going to publish that on d.o.?

Point to your Drupal instance

mshmsh5000's picture

Hi ccshannon, and sorry for the delay. To make the node_counter integration happen, you need to tell Pixel Ping to flush its data to your Drupal instance. And you need to enable the Pixel Pusher module on your Drupal instance, so it provides that path as a valid callback.

If, e.g., your Drupal site is at mysite.dev.lh, enable Pixel Pusher on that site, and then set the right URL in Pixel Ping's config.json:

"endpoint": "http://mysite.dev.lh/pixel_pusher/save_hits"

You'll also need to make sure this path doesn't get cached in Varnish -- I haven't implemented a test with Varnish, but I bet you just want to exclude this path in the .vcl.

At this point, you should be able to simulate traffic to your local site and see the node_counter table getting updated after every Pixel Ping data flush.

Let me know how it goes.

APC version

drupal4media's picture

To eliminate the Pixel Ping dependency, how about creating a version that temporarily stores the data array in APC? It's very common to use APC in high traffic sites and it would avoid the one DB hit per node view problem in Boost.

Esteban

Patch for Boost

mikeytown2's picture

If you want to create a fully working patch for boost I will commit it.

Something like this?

<?php
 
if (function_exists('apc_fetch')) {
   
$key = 'boost_node_counter: ' . $nid;
   
$count = apc_fetch($key);
    if (
$count) {
     
$count++;
    }
    else {
     
$count == 1;
    }

   
apc_add($key, $count);
  }
?>

Code above isn't atomic, multiple servers doesn't work. Memcache would be a better option; need locking too if we can't figureout how to make it atomic.

I will create the patch for

drupal4media's picture

I will create the patch for Boost.

Actually there shouldn't be any problem with multiple servers because the idea is to store a data array as you described above and flush it to the database with cron:

nodes
  nid-1
    SESS-A
    SESS-B
    SESS-C
  nid-2
    SESS-A
    SESS-C

In fact, I think a local store is better than Memcache because we need a lightweight locking mechanism (it can't be made atomic). I was thinking in semaphores for locking and we could even use shared memory instead of APC. Using APC might simplify monitoring if it's available but shm seems to be a good fallback.

The thing is I'm trying to replicate the Java solution for this problem (Singleton with cron) but I'm not sure if it feels natural in the LAMP world.

Cheers,

Esteban

I prefer to use temporary table

jcisio's picture

I prefer to use temporary table, like node_counter_temp with a simple structure (nid, timestamp). Every hit adds a row to this table, and a cron job clean up it and update node_counter. An INSERT is much faster than an UPDATE, without care of atomicity.

This table could be replaced by APC/memcache locally and a cron updates node_counter periodically. But I don't know if this method is doable, as we can't fetch all "rows" in APC/memcache.

After a little more research,

jcisio's picture

After a little more research, memcache can use token (CAS) for atomicity http://stackoverflow.com/questions/3300166/perform-atomic-array-modifica.... With that, we can store an array keying by nid => hits, or separate values keyed by nid.

Some working code

drupal4media's picture

I have put together some code that seems to be working and does the following:

  • Exposes a new callback through hook_menu that stores the node hit in shared memory. This should be replaced by a custom script (as in Boost) in order to avoid the Drupal bootstrap that is not required at all. It was done in a callback only for easier testing.
  • Appends an image tag to $closure pointing to this callback in hook_preprocess_page only for nodes.
  • Flushes the data stored in shm to the database in hook_cron executing a multi_query to avoid many trips to the server.

It would only work on Unix systems with System V IPC support because of the shared memory stuff and the synchronization being done with a semaphore. It also requires the mysqli interface due to the multi query executed in cron. It seems demanding but this module would be on the extreme side and I think this setup is pretty usual for those cases. What do you think? I'm not sure about the feasibility of including code with these dependencies in Boost...

Please let me know if you'd like to see some code or continue discussing its usefulness and implementation options.

Cheers,

Esteban

Not sure what this solves

mshmsh5000's picture

The point of this module and its reliance on Pixel Ping/node.js is to take the load entirely off the LAMP stack for high-traffic sites. Your proposed changes seem both totally unrelated to the original module and intent on solving a very different problem.

The node.js implementation is meant for sites handling very high traffic -- hundreds of thousands or millions of page views per hour. My fear in implementing any Drupal-side pixel-serving solution was that relying on the LAMP stack in these circumstances, even with a beefy pool of web heads, would lead to face-melting disaster.

So, follow-up questions:

  1. Are we talking about the same universe of traffic profile?
  2. If so, what's so different about our infrastructures that makes you more confident in your LAMP stack for handling this, rather than an incredibly lighweight process like node.js?

db connection

mikeytown2's picture

I would take a guess that the DB connection is one of the slowest parts of registering a hit via boost_stats.php so if you can reduce that down to a once a minute occurrence that would speed things up dramatically. A simple "hello world" php script is not too far behind a static file according to this blog post: http://blog.a2o.si/2009/06/24/apache-mod_php-compared-to-nginx-php-fpm/

Speaking of logs, what about writing to a local file if shared memory is too complicated. Thats what the apache log file is in short right?

Memory as well

mshmsh5000's picture

Part of what node.js does very well is to run at a high capacity with low memory usage.

http://nodejs.org/#about

Even "Hello world" requires a separate thread.

I still take your point, though, which is what you made after my original post: it would be good to decouple the module from Node.js and allow for a pluggable layer. The problem on the implementation side is that Pixel Ping isn't just a storage layer; it's also the facility for posting data back to the Drupal module. So it's possible that a more direct path would just be a totally separate module.

I understand your point

drupal4media's picture

I understand your point. The discussion started with the pluggable layer concept and it finally diverged far from there =)

As you mentioned, the Pixel Ping approach won't be easy to abstract in a pluggable way so it would make sense to be a separate module. I like the Pixel Ping approach a lot and would only suggest to use mysqli multi_query to flush the hits with INSERT ... ON DUPLICATE KEY UPDATE...

On the other hand, perhaps we can continue the discussion on a different module with a pluggable layer to complement Boost in a different place :)

I put my code here:

https://github.com/drupal4media/HPStats

Cheers,

Esteban

IPC? Nice touch.

xaris.tsimpouris's picture

IPC? Nice touch.
Apparently, if we put _counbter and _image in a different file, we can avoid even bootstraping Drupal for just a pixel as nid is provided through URL parameter.
Right?

--
Xaris
yet another drupal developper
http://1204.gr