How Manage DB and multi subsites for a website newspaper with million nodes?

We encourage users to post events happening in the community to the community events group on https://www.drupal.org.
jscm's picture

Hi,
I have to choice how manage section of a large Newspaper Website.

I could choice to use only one DB to manage all but I will get 2 tables with large amount data (about nodes and search index) and great problems about queries optimizations for search index.

I could be manage the website using a DB per section, with a single website per section. (sport, business, leisure, etc etc). this solution required a top domain + several subdomains. But, I will get several problems like difficult to manager a front site with all information from all subsites, users registration for every single website, etc. etc.

I found out that there is a several bug in Drupal 6, about search indexes queries, when drupal have million nodes.

Can you help me to understand what I should do?

I want a site like this:
http://www.espn.go.com/
http://www.reuters.com/
http://www.businessweek.com/
http://www.economist.com/

This web sites required sub websites in several languages (english, italian, spanish) like in Reuters or Business Week.
Every website will have several sections, but I don't know how manage all.

Which modules are used in The Economist websites?

Sorry for my bad English
I'm Italian

Best regards

G.Aloe
Milan (Italy)

Comments

Quick tip

mikeytown2's picture

Quickest way to get where you want to go is to use Mercury.
http://groups.drupal.org/node/70268
https://github.com/pantheon-systems/mercury

If your pushing your box to the limit I would use
Percona instead of MySQL; or MySQL 5.5
http://www.percona.com/downloads/Percona-Server-5.1/LATEST/
nginx instead of Apache
http://groups.drupal.org/nginx

Good luck

One database, several servers

yelvington's picture

Don't cut your database up into separate pieces -- that's an integration and data management nightmare.

Mikeytown2 is right about Mercury -- since most of your traffic is not logged in, a reverse proxy cache makes sense for news sites.

I don't know what you mean about a search "bug" in Drupal 6 -- can you link to an issue that's been filed for such a bug?

Hi yelvington, you suggest me

jscm's picture

Hi yelvington,
you suggest me to don't cut DB in separate pieces because a management dada in a nightmare. but If I keep all in one DB I will get node_revision table and search_index that will became large, with million nodes. So with this great amount of data I will get really problem to manage backups and import db.

Follow these links to understand the bug in Drupal 6 for sites with great amount of nodes.
http://drupal.org/node/312395
http://wtanaka.com/drupal/million-nodes-6
http://drupal.org/node/312393

When you speak about several servers.. do you think to synchronise DBs and files in separate machines?

What do you think to build a news website with.. multisite with multidb, one site for one nation?

Search vs core functionality

yelvington's picture

Those issues aren't about Drupal in general -- they specifically refer to search.module, which implements a rudimentary search engine on top of MySQL. It is known to have scalability issues.

Most large sites simply don't use it. We use FAST. A lot of sites (including drupal.org) use Lucene/Solr, which of course is available from Acquia. I'm not sure what's in the Pantheon stack but I think it may use Solr, too.

Our issues with Drupal have had to do with performance under peak traffic loads, not the size of the data set. Our approach has been to have several tiers:

  • Multiple reverse proxy cache servers (we use Squid) with round-robin DNS. These serve anything that can be cached -- anonymous pages, CSS, Javascript, images, et cetera -- and remove all that load from the Drupal server. If they don't have an object, they request from the next layer.

  • Load-sensitive balancer. If a Drupal box goes down, the site stays up.

  • Multiple application servers. Filesystems are shared. We actually run multiple newspapers on each cluster, which helps by averaging out system load. All the cache tables are moved from the database into Memcache, which is installed directly on each app server to avoid needless network traffic.

  • Separate MySQL database server with all the tables converted to InnoDB.

On that last item, we'd like to use MySQL replication but it's not as simple as it looks. I understand this gets much easier with Drupal 7's PDO layer, which is smart enough to split reads from writes, sending the latter to the master.

On one of our sites, we initially loaded every bit of data we'd collected since the early 1990s. When we were struggling with some performance issues we backed out and moved several years of very old data to a separate system.

You could think about doing something like that, but you need to plan ahead (hint: write smart Pathauto rules so your cache layer knows where to send the traffic).

Since archival data doesn't change (at least not very often) it doesn't need to be in the same editorial environment. However, this does make it more difficult to integrate historical data with current news, as in topics pages.

Performance optimization is a tricky subject and you can't optimize for everything at once. You might want to follow this g.d.o group: http://groups.drupal.org/high-performance

Servers and Load

jscm's picture

Hi Yelvington,
I'm studying step by step your suggestions using information that I found on web.

You say me

We use FAST

But what is FAST? I didn't find information. I have find out more information about Apache Solr.

What tools you use to manage Load-sensitive Balancer in Linux Servers?

Is it really necessary to use Proxy Servers in Round Robin DNS?

I have to study the MySQL Replication but if I keep mysql in the same partition of Drupal files, it will be replicated by cluster. Do you suggest to keep mysql dbs in another server partition or in a different server?

thank you Yelvington

Some answers

yelvington's picture

FAST: http://en.wikipedia.org/wiki/Fast_Search_%26_Transfer
Acquired by Microsoft since we set out to use them. If we were making the decision today we might be inclined to go with Solr instead, but we're not in a hurry to switch technologies, as switching is expensive.

I'm not sure what we're using for load balancing; different department. But it's probably something from Cisco.

The reverse proxy server is probably the most important part of the puzzle. It provides almost complete protection against being Slashdotted/Farked/etc, because most anonymous traffic never makes it to the app layer. As I mentioned, we use Squid, because we've been using Squid for years, but Varnish is the new hotness. http://www.varnish-cache.org/

We have kept our database servers separate from our application servers since the 1990s. This makes it much easier to figure out where bottlenecks are slowing us down. You may well discover that you never need a replicated database with multiple DB servers.

Newspapers on Drupal

Group organizers

Group categories

Topics - Newspaper on Drupal

Group notifications

This group offers an RSS feed. Or subscribe to these personalized, sitewide feeds: