High traffic volume sites with high accuracy statistics

We encourage users to post events happening in the community to the community events group on https://www.drupal.org.
xaris.tsimpouris's picture

Hi,

The problem:
For every post, I want to show the exact number it has been viewed by everybody, anonymous or not.
This does not only include showing in every post page, but also in views with results of various posts.

Not being a problem:
I don't have problem if the numbers are refreshed every now and then, for example every 30 minutes.
However, statistic numbers must be accurate.

As it is now (Everything throught drupal)
This is a problem as for high volume traffic sites, we have serious performance issues for many reasons.
This idea is already implemented and that's why we want to change it.
Pros: Full drupal system
Cons: If we put Varnish - or anything equivelant that in any way does not bootstrap drupal, anonymous hits are not shown.

I believe, the best way is to use Varnish, or anything equivelant but also save somehow all hit counts.
But how?
In order to use Varnish but not lose hits is:
A) Varnish has to log everything (can it do so?)
B) Somebody else has to save hit counts,

..and then once in a while, submit this information in drupal.
..or take this information from a different source, throught javascript and show it to the user, without the need for drupal for doing anything at all (everything will be static for drupal)

So.
Idea A. A user hits a page. Takes the response, even from Varnish. Asynchronously, (javascript) sends a request(R1) for every statistic, gets a number, shows it to a spesific div.

R1 could be:
R1-1. Google Analytics, and through javascript Google Analytics API we request the view statistics for a spesific page.
Pros: Google
Pros: Our site is fully cachable, no hit is lost
Cons: Google may complain for high traffic requests (501/503 errors), there are specific limits for the API.
Cons: We have to give an API key inside javascript

R2-2. Something else in our server, with custom PHP and MySQL queries that log everything? Could this be a good way to avoid Drupal bootstraping and everything else? Apparently, with this way we will have to implement an asynchronous call as google analytics javascript for every page, to our own logging system

Idea B. A user hits a page. Takes the response, even from Varnish. Drupal once in a while (cron job) is refreshed by new view counts. This value could be a cck field, to avoid another JOIN. But how is this cck field is refreshed?
1. Cron job, checks Apache/Varnish logs and refreshes cck value (hardcoded [sql quieries] or drupal way [load nid->save])
2. Cron job, through Google Analytics API, that requests statistics for many pages (simultaneously) and refreshes cck values the same way as B.1

Has somebody thought of all these? Has he or she found the best solution?

Comments

Google analyics api

joshk's picture

Cron job, through Google Analytics API, that requests statistics for many pages (simultaneously)

I would recommend this option. Building your own analytics system is a lot of work, and you should be able to get this information from Google pretty easily. I would probably not do it via drupal directly, but rather have a separate simple system that can request the statistics.

I would also not use CCK as your storage mechanism if you're concerned about performance. Take control and index this data.

Solid points, just one

greggles's picture

Solid points, just one suggestion if the original poster does want to do it via Drupal

http://drupal.org/project/ga_importer

That pulls GA data into Drupal.

smaller problems

xaris.tsimpouris's picture

Problem A.
Google Analytics exports data as per daily basis. So, maybe an accurate way, but extremely slow.
I can't way a whole day just to see a number :(

@greggles: Nice module, however it works with cron hook for all nodes, but only once for every node every day

Problem B.
Why not a CCK field? One field, one value per node is quick enough.

--
Xaris
yet another drupal developper
http://1204.gr

You could do it with Varnish

vegardx's picture

You could do it with Varnish and ESI, otherwise I have no idea how you'd be able to go forth without seriously smashing your database, as you more or less cannot use any form of external caching. Also, you might want to consider memcached. Why do every user need absolute numbers? What about psuedo-dynamic, update the numbers every X seconds, it will look dynamic for the users, but you mitigate more or less every hit, as you only make one hit every X seconds.

--
Vegard

Do

For a lightweight pageview

Vacilando's picture

For a lightweight pageview counter, check out Google Analytics Counter. It reliably and efficiently fetches all data at cron runs, so there is no burden on the user.

And it's pretty scalable. Depending on the size of your site and on the frequency you want to have pageview counts updated, even with the current Google Analytics API quota limitations you can theoretically fetch counts for 10,000 unique paths every 8.64 second -- or, on the opposite extreme of the scale, on a site with 100,000,000 pages you can still provide a counter update for each of these pages once a day.


---
Tomáš J. Fülöpp
http://twitter.com/vacilandois

Varnish?

Kristen Pol's picture

How can this work when Varnish is installed? GA won't be hit on cached pages, right?

It will be hit, because the

aries's picture

It will be hit, because the meters are not on you site.

Aries

Ah...

Kristen Pol's picture

Yes, of course. Hadn't had my coffee yet ;)

So, this seems like the easiest way to get page counts on a high-traffic, Varnish-enabled website to me. I'll keep it in mind.

Thanks!

May be cache control module

vabue's picture

May be cache control module would help you?

jstats

mikeytown2's picture

Have you looked at http://drupal.org/project/jstats in combination with something like ESI http://drupal.org/project/esi ?

If you need this

Jānis Bebrītis's picture

If you need this functionality to display the hottest content, take look at radioactivity module - http://drupal.org/project/radioactivity