Best architecture for news site?

Posted by mburak on November 8, 2009 at 7:12pm

Hey, I'm new to this group and I'm very happy of joining you.
In my company, we're migrating our news sites based on an old Java CMS that we've built some years ago to Drupal.
Our major concern is Drupal's performance, so we're starting to work on an architecture for it. For our old cms we used to generate static pages so the only job for apache was to serve those htmls. I don't know if there a way for Drupal to do this and at the same time have people logged in in the back-end creating nodes. What would be a good architecture for this? We have about 50k unique users a day and growing.
Thanks for your thoughs!

Matias.

Comments

Doable

Posted by kbahey on November 8, 2009 at 7:30pm

It all depends on how you architect the site and what the users do on the site. If they are posting comments fast, or posting a lot of nodes, it adds more load than a site that is mostly read only.

And 50,000 visits a day is not that high. We manage sites that do over 300,000 visits a day, and over 1.3 million page views.

For static caching, you can install the boost module, then crawl the site using httrack or something similar. Here is more info.

Drupal performance tuning, development, customization and consulting: 2bits.com, Inc..
Personal blog: Baheyeldin.com.

Boost

Posted by mikeytown2 on November 9, 2009 at 1:17am

I second the recommendation to use Boost; you might be able to get by with 1 server, if you have everything cached in Boost. The module has a built in mulit-process crawler that will hit every url in your url_alias table. In short boost tries to do as much as it can on it's own so you never have to worry about the cache again. Because of its power, it can take some time to setup, but once it's running Boost works like magic. The good/bad thing about having everything in static html cache is you can do a restore of the cache directory in case something bad happened to the cache. I had a situation where we did a restore of the cache dir and that saved our butts; the reason we needed to do that was because the cache dir got flushed by accident, and with the cache gone we crashed almost instantly due to the load. You can setup boost so this will almost never happen; just in this case the server was not setup like that.

To prevent the above situation from happening; these are the setting I would set (enable), after installing with the defaults. This is what I would call my high traffic settings:
Cache .xml & /feed
Cache ajax/json
Overwrite the cached file if it already exists.
Expire content in DB, do not flush file.
Enable the cron crawler
Do not flush expired content on cron run, instead recrawl and overwrite it.
Crawl All URL's in the url_alias table.
Set FileETag 'MTime Size'

After setting all of these settings, get the htaccess rules from boost-rules; check the status page and you should be good to go.

About re-crawling

Posted by vacilando on November 9, 2009 at 10:18am

@mikeytown2 -- thanks for posting your recommendations for Boost settings. I've checked mine and found two features I did no have enabled (Expire instead of flush, MTime).

One question: while the option to re-crawl and overwrite at cron run seems attractive, it is not clear to me when, in such situation, would old and truly useless cache entries be flushed. (E.g. when the cached page does not exist anymore, etc.) Or is it so that if you re-crawl instead of flushing, the cache will slowly increase (entropy!) and one should remember to flush it completely manually?

---
Tomáš J. Fülöpp
http://twitter.com/vacilandois

404 & 403

Posted by mikeytown2 on November 9, 2009 at 10:54am

Your talking about when a page is either deleted or unpublished correct? Boost has a mechanism so that if a 404 or 403 is returned, that page will be nuked from the cache and the entry will be removed from the database. With the next version 6.x-1.15 due to be out very soon, it will do that right at the time a node is deleted or unpublished. Good question, this wasn't an easy problem to solve, but it now works correctly when operating this way.

Sounds like just the right

Posted by vacilando on November 9, 2009 at 1:07pm

Sounds like just the right thing - if the source (be it a node, AJAX call, whatever) is unavailable, delete it from cache as well.

However, what about (unlikely/brief) cases when we have to rely on cache -- e.g. if the db is overloaded and does not generate the source (or not in time), I imagined Boost would serve it from cache. But if Boost assumes it's not existing and deletes it from cache as well...?

---
Tomáš J. Fülöpp
http://twitter.com/vacilandois

503

Posted by mikeytown2 on November 9, 2009 at 4:55pm

If the database is not available then drupal will return a 503 for all requests. The one issue that doesn't make this perfect, at least not yet is when a 301 is given. I need to investigate how global redirect works and go from there. This is only an issue if you rename a page and have the old one auto redirect the the new one using http://drupal.org/project/pathauto http://drupal.org/project/path_redirect http://drupal.org/project/globalredirect.

Best architecture for news site?

Comments

Doable

Boost

About re-crawling

404 & 403

Sounds like just the right

503

High performance

Group organizers

New groups

Group notifications

Hot content this week