Millions of nodes

Events happening in the community are now at Drupal community events on www.drupal.org.
Wesley Tanaka's picture

A partial investigation of running a Drupal (6) site containing millions of nodes

The (three) problems that I've run into so far are all in search.module -- the "more" core modules seem to do okay.

Comments

I'm curious how Views2 will

moshe weitzman's picture

I'm curious how Views2 will do for you. I find that pager queries are very slow when determining the count.

InnoDB or MyISAM or both?

kbahey's picture

On InnoDB or MyISAM, or both?

COUNT(*) can be slower on InnoDB than on MyISAM. Also, MyISAM can be slow if there are conditions that are not on index columns.

Running EXPLAIN gives some clues, but there is no general catch all solution.

Drupal performance tuning, development, customization and consulting: 2bits.com, Inc..
Personal blog: Baheyeldin.com.

Drupal performance tuning, development, customization and consulting: 2bits.com, Inc..
Personal blog: Baheyeldin.com.

The last time I looked at

Wesley Tanaka's picture

The last time I looked at Views was a long time ago. It could have been Views1, I don't recall. At the time, the code contained dynamically generated SQL queries which appeared like they would run without the benefit of proper indexes. I stopped investigating Views at that point. Has that changed at all?

Thats still true, but I'm

moshe weitzman's picture

Thats still true, but I'm not sure it a fair characterization. With Views, you build a SQL query through the GUI. Presumably you put the fields and sorts and conditions that are needed for your task and nothing more. After that, itis your job as a developer to make sure the right indexes are there for your query. Thats true regardless of whether you used Views to build the query or not.

If you have the know-how to

FiReaNGeL's picture

If you have the know-how to build you SQL, do it - to me views is a great module for users who don't have mysql knowledge, or for some specific usage such as generating an archive view.

Even if (after you have

owen barton's picture

Even if (after you have added appropriate indexes) the views query is not fast enough you can actually still use views for display and pass in your own optimized query. This is a great way to get the best of both worlds :)

Very interesting as I'll be

FiReaNGeL's picture

Very interesting as I'll be launching a site next year with millions of nodes at start too. Any other problems than search you encountered? Also, I suggest you look into Sphinx for your search needs, the Drupal one just doesn't compare for non-basic sites.

Thanks for the

Wesley Tanaka's picture

Thanks for the suggestion!

To answer your question, I haven't encountered any other "lack of index/full tablescan/using temporary/using filesort" kinds of problems yet.

Solr Search

shyamala's picture

We are using the Drupal Frame work to develop our new portal. We are avoiding the use of the nodes for the business logic area. (We use drupal's form API, User Management, Form API, all the core features + the community features)Working towards using Solr Search. The Indexing and Search is extremely fast in Solr.

Netlink Technologies Ltd
http://shyamala-drupal.blogspot.com/

One benefit to using nodes

Wesley Tanaka's picture

One benefit to using nodes for everything on a small or medium traffic site is that most third party Drupal modules act upon nodes, so you can sometimes take advantage of functionality in contributed modules "for free" if everything is a node.

Wesley Tanaka
http://wtanaka.com/drupal

Good stuff

damienmckenna's picture

Impressive analysis, Wesley. I've blogged it with the hope to spread the meme, and maybe drive interest to get it rolled into a future release.

Thanks for the info. I am

zeezhao's picture

Thanks for the info.

I am currently working on loading a site with about 1.5 million nodes. Its still in development, but I have encountered the same issues Welsey has discovered with the search index building. Have managed to run cron in batches of 50,000 nodes... but its now creeping to a halt... Still got 500,000 to go... So thinking of alternatives... e.g. Sphinx? But would have preferred search.module to work to keep it simple... [still using drupal 5.7]

The other main issue I had was using node_import to import the nodes. It was using up all memory allocated (500MB) and also very slow (18,000 rows per hour...) See this comment for some info on resolution of memory issue: http://drupal.org/node/309563

Drupal does not scale well

ebruts's picture

I don't get it.

Why doesn't Drupal scale well? And why noone seems to cares?

Recently we have moved our moderate community (50.000 nodes, 600.000 comments) from phpBB to Drupal, just to notice that our server load exploded from < 1.0 up to 20-40!

The main reason for that was the advanced_forum and forum_access module. And we really needed it because the standard Drupal forum is too basic.
We could resolve most of the slow querys in advanced_forum with a custom cache.

The forum_access, however, was a complete fail as it used node_access, resulting in the need of updating every single node once permissions have been changed for a forum.

Why do we have node_access in the core? Whereas a taxonomy based access would be much lighter and faster.

As we could not find any lightweight access control not depending on node_access, we have decided to develope our own, fitting the needs of a forum.
Finally this reduced the load to < 2, making the site accessible.

Now it was time to discover the next problem. The search does not complete withing 40 seconds, resulting in a fatal error timelimit hit.

After some research I found that Drupal.org (hell, i mean it's the site to promote Drupal) does not use their own search engine! Instead they have apachesolr_search.

How come the phpBB search completed in times under 1 second while the Drupal search fails completely?

I am sure Drupal.org uses tons of hacks, far away from the core, to keep up.

Like http://drupal.org/sites/all/modules/memcache/ shows us that the core cache system can't really compete with other systems, resulting in too many querys per page. But then again, why building up a lie and not working against the problem?

I mean I really love Drupal, as a CMS. The Form API, module system and everything is just wonderful to develope nice web applications. And that was the main reason for our move.

But somehow Drupal terribly missed to keep an eye on performance.

Ignoring the rant-y nature of

ceejayoz's picture

Ignoring the rant-y nature of this post...

Things like memcached and Apache Solr require significant system administrator knowledge to install and maintain. They are simply not an option for the vast majority of folks running Drupal.

Drupal's core search is perfectly good for small to mid-size sites. Larger sites tend to have the ability to install something like Solr, or purchase an Acquia subscription for it.

Your main issue seems to not be Drupal, but the fact that you apparently didn't do any research or load testing before making the leap.

You probably missed my main

ebruts's picture

You probably missed my main concern about the search. phpBB just served the search in under a second.
With it's own search engine, no third party installations and no significant system administrator knowledge needed.

Why'd you switch from phpBB,

ceejayoz's picture

Why'd you switch from phpBB, then?

My friend, that was the right

ebruts's picture

My friend, that was the right question.

But it should be "Why would other mid-sized communitys move?"

I did move because I am a real Drupal fanatic. I use it in all of my projects, like I pointed it out in the first reply, I really love the features Drupal offers to build up web applications.

But at the same time, I would love if people would care a bit more about performance. I don't really think that others would keep up with Drupal if they encounter the first problems. They probably will look for alternatives.

Drupal is a tool

skwashd's picture

I did move because I am a real Drupal fanatic. I use it in all of my projects, like I pointed it out in the first reply, I really love the features Drupal offers to build up web applications.

If I was a project manager and you pitched Drupal as you just have in that comment I would run a mile.

Drupal is a tool, lets call it a hammer. It is an excellent tool if your problem is driving in nails. In some cases your problem is driving screws. A hammer will drive screws in if you try hard enough, but it isn't the best tool for the job. In one of my projects we realised having several million nodes each with multiple node references was probably not going to scale as we wanted, so instead we built the application using Zend Framework. If we had have used Drupal our screws would have been trashed.

Testing is a crucial part of any migration, it doesn't matter if it is from one webapp to another or from one hosting platform to another - you have to test it. It seems in this case you did the conversion using as much off the shelf bits of Drupal you could and didn't test it for performance. Sure it will take a bit longer to get it right the first time, but saves you a world of pain - as you have discovered.

As for search, I have never been a big fan of the built in search in Drupal. Even for my small sites I use apache solr. If the client is big enough I run solr on their VPS, or if I am hosting them they use my shared solr server. Again it is tools and tasks. At the same time if a client insists on running a small site on shared hosting, then I will set them up with the default search engine.

Trying to make the problem fit the solution is never a real solution. Use the right tool/s for the job.

But it should be "Why would

ceejayoz's picture

But it should be "Why would other mid-sized communitys move?"

They shouldn't, unless they've got a compelling, well-researched reason to. A community built around a web forum is probably going to be better off staying with vBulletin/phpBB than moving to a more generalist system like Drupal, unless they plan on using more of its features.

But at the same time, I would love if people would care a bit more about performance.

There are plenty in the community caring about performance. The Apache Solr module, the Boost module, the improvements in each Drupal major version's core, the specialist distributions like Pressflow/Project Mercury...

What is drupal?

nestor.mata's picture

Maybe, the first question is what is Drupal, and what is Drupal for?
Drupal is a base infrastructure very configurable that could fit for many, many needs.
But to be as general has a price, every time that you need to get specific you need to work hard.
Drupal it is think for performance, but also for many general porpoises and basic hosting plans.
For this reason you need to see your needs and what you have to work with and make customization for that.
Right now I have Drupal serving more than 1,5Million hits per day and is answering in less than half second for most of the time with an average load of 0.1-0.8.
But this costs a lot of work to get there.
In the other hand why did PHPBB worked fine since just installed, one reason, it is specific for that, the only reason they had in mind so it's a lot easier to make it a fixed solution.
Could Drupal beat that? Yes, and a lot more yes if you need to make real customizations to your site and make your site work with other site or something advanced, but, a customized site, needs a customized solution.
And this can´t be done on live, you need to analyze and plan and test.
So, my resume would be, if you need a Boletin Board only, you can go ahead and use phpbb, if you need something more complicated you can use Drupal or even Drupal with phpBB.
I'm a Drupal fan too, but that doesn't means everything I do I do it in Drupal, I always seek for the best solution for each one of my problems, I don't use a hammer to cut wood, if you know what I mean.

Best regards,
Nestor

Ummm - you seem to be

Jamie Holly's picture

Ummm - you seem to be confusing an issue with a contributed module with Drupal core. I suggest filing an issue with forum_access and maybe offering up some help to incorporate your fixes.

As far as memcache -vs- Drupal's native DB cache, you need to remember that a vast majority of Drupal installations out there are on shared hosting. They don't have memcache on their servers, and in their cases the native DB caching in Drupal provides a significant performance boost. The developers did keep performance in mind by providing a mechanism where the small percentage of sites that have access to and can use other caching systems are able to. That speaks to the versatility and power of Drupal.

And what about Drupal.org using Apache solr? It shows that Drupal again can be extended for different uses and server configurations. Again - not everyone has access to Solr, so they can use the native Drupal search. Or if they don't like that or Solr they can use something else like Sphinx. The point is you will never get the performance or flexibility that these other search engines have in native MySQL. Again - Drupal is mostly installed on shared hosts without these services.


HollyIT - Grab the Netbeans Drupal Development Tool at GitHub.

See my last reply about the

ebruts's picture

See my last reply about the search.

I would help forum_access if I could but there simply is no way out when having too many nodes.

I found this article here describing it pretty well and this is what I did after all. Not using the node_access completely. Again I don't know why we have it in core, it really only works for very small sites, like personal blogs.
http://2bits.com/articles/how-drupals-nodeaccess-table-can-negatively-im...

And yet Drupal's native

Jamie Holly's picture

And yet Drupal's native search works fine for a very vast majority of the Drupal sites out there - some with more than 500,000 nodes - even one site I've done work for with over 750,000 nodes and 4 million comments.

It boils down to the fact that not every site is the same. When you have a system as versatile as Drupal, there is no way you can make it gives the absolute best performance for every scenario possible out of the box. It requires testing and reconfiguring.

The forum module has never been the best, but it fits fine into Drupal given how few sites actually do use it. The power of Drupal is that you aren't stuck with using the forum module as your only forum solution. You can use one of the many other solutions out there, or develop your own. You can even steal the forum module, put it in sites/all/modules/forum and make all kinds of changes to it. Just delete the sites/forum folder out when you update.

If you like Drupal so much (and I'm glad you do), why not help make it better? If you see something that could be improved upon, by all means open up a ticket. It is far more productive than unleashing some rant. It's also the appropriate place for a discussion on how the help improve core features.


HollyIT - Grab the Netbeans Drupal Development Tool at GitHub.

News management site with millions of news

sanmatteo's picture

Hello everybody, this seems the right thread for my question.
I'm about to develop a news management site and i have a csv database with millions of news (about 3 now, and we calculate about 5 millions each year from now on).
Is Drupal a wrong choice?
I can renounce to Search if it's such a problem (is it enough to disable the module?).
I would like to know which other kind of limitations i should expect.
I've read above zeezhao's problem with node import. Could I overcome it using Import/export api module or the import feature of Feeds module?
My main interest is in Calais module, to perform automatic semantic tagging on my news; i know the only limitation here is the one of the Calais API, so it's OK.
Would i face the same problems of the search module, when i for example click on a news tag to see all the news with the same tag?
Does the taxonomy module have the same implications of the search module for very large databases?
I'm strongly interested in Views2 module too, I would use it together with extensions like charts module, tagmap, etc to visualize trends in news.
Are there ways to make non-greedy sql queries and avoid performance breakdowns?
I should add that it won't be a traffic intensive site, just our research team and maybe some clients (let's hope!).

Thanks to everyone who'll want to help me.

Drupal has the features, you need to know the pitfalls

murrayw's picture

Is Drupal a wrong choice?

I wouldn't say so. Drupal has a lot of features that would probably fit nicely with your site. You probably will also have a few issues with a site that big but they are not insurmountable if you do thing sensibly.

I can renounce to Search if it's such a problem (is it enough to disable the module?).

You would need to turn off the standard search and use something like Solr which can handle millions of rows.

I would like to know which other kind of limitations i should expect.

I touch on some of these things here:
http://cruncht.com/93/drupal-troubleshooting
And some of the problems I ran into on my big site are here:
http://cruncht.com/361/uriverse-dbpedia-drupal-case-study#performance-shortcomings

To recap:
- difficult to browse content in the admin section for editing due to a join from node to user table to get back the nodes
- difficult to edit content if some of your nodes have many node references
- avoid teaser/node presentation for long lists in views if your nodes have a lot of cck properties (node references etc) as node load is slow
- the standard taxonomy page as well as the view is unusable with lots of nodes. You can switch Solr in for this.
- getting your url_alias and node indexes into mysql key buffer is crucial or your site will be deathly slow when showing lots of links as well as for simple node loads in any situation
- some modules don't work so great with large amounts of data. Sometimes it is due to dodgy joins, 'or' comparisons or just plain bad db design. Up to you to do due diligence on the modules to make sure they work for your dataset.
- the standard SQL created by views is way too slow for most queries built using the GUI. The main problem is that LEFT JOIN is used in many cases (across to the CCK tables) when an INNER would be fine. You can rewrite the SQL using a hook to fix this though.
- Be careful in general with views and all SQL. Avoiding joins in queries is a biggy as well as making sure proper indexes are being used.

I've read above zeezhao's problem with node import. Could I overcome it using Import/export api module or the import feature of Feeds module?

The way I did it isn't "right" way to do it but I resorted to writing my own SQL and PHP scripts to import the data. In the end I just couldn't be bothered to work out how to programatically create content types with schemas using CCK. Ditto for creating the nodes. However, I don't think using the Drupal mechanisms would have worked anyway owing to the size. Do you really want to suffer node load overhead 3M times? Fast loads with MySQL can get thoughputs of 10K+ rows/s. Using Drupal will obviously be factors slower than this. With 3M nodes it could mean the difference between days/weeks vs months of import time.

During initial load into MySQL make sure you do it from a file directly into the table and make sure the file is on a different disk to the DB. Speedup here is 2 to 3 factors faster than iterating and inserting. Here are some more notes on DB import stuff.

Would i face the same problems of the search module, when i for example click on a news tag to see all the news with the same tag?

Solr is very fast with filtering. I haven't done it with tags though. I'm not sure what consequences that has for the size/performance of the Solr repository. I do remember reading that more indexed fields equals more RAM in Solr. If you have 3M nodes then you could be looking at 1G of RAM (very big guestimate) to handle it. More fields may make this grow. Just flagging this as an issue.

Does the taxonomy module have the same implications of the search module for very large databases?
You can still use taxonomy with your site. Just be aware that you will probably need Solr to do cool stuff (faceted filtering) quickly.

I'm strongly interested in Views2 module too, I would use it together with extensions like charts module, tagmap, etc to visualize trends in news.

I don't know these modules so can't comment.

Are there ways to make non-greedy sql queries and avoid performance breakdowns?

Not sure what you mean by "non-greedy". Views works by firstly getting a list of node ids and then it works out how to present then (node, teaser, fields). For node and teaser a node load is done. This can be slow if nodes are heavy and lookup in DB is slowish. I found that using Fields was a big help. You still suffer two hits to the DB for each node (node and url_alias) but the data that is being loaded is much smaller. This assumes that you have some heavy nodes with lots of CCK.

I should add that it won't be a traffic intensive site, just our research team and maybe some clients (let's hope!).

The other bit of advice I seem to be dolloping out these days is that Boost will help your site a lot and take the pressure off MySQL if things are intense at the DB. Good luck with the traffic.

Managing Director
Morpht

Murray, thank you very much

sanmatteo's picture

Murray,

thank you very much for your long and detailed reply.
Didn't know about Solr, i'll try to integrate it in my system.
As for your suggestion about mysql import, it's good for the first massive import, but for the next days i'll have people updating news day by day, so an import module could work especially for non expert users.

I'm growing a comparison site

morningtime's picture

I'm growing a comparison site for electronics products, Ice Scan Electronics. It's up to 1 million nodes now, but spread accross 10 languages (with i18n) and 9 domains (with domain module). And about 3 million taxonomy terms.

So far I have found there is a single core function that will kill you:

taxonomy_get_tree()

That function simply is NOT suitable for big sites with millions of terms. I causes PHP memory outages. Which means you can't use admin/node/overview (the content list page), I replaced it with Views Bulk Operations' alternative.

Also we had a pathauto issue, because it called taxonomy_get_tree(). We stopped using child-terms in paths, and remove the function.

So far, hosting just needs a lot of RAM. We use Apache Solr as a search replacement, and about 2GB of Memcache to keep the server happy. MySQL is eating about 3-4GB of RAM. Solr has 1GB.

pathauto

mikeytown2's picture

I found that pathauto is the major slowdown.
http://drupal.org/node/202319#comment-2021584
You are correct though, taxonomy_get_tree is apart of the issue.

High performance

Group notifications

This group offers an RSS feed. Or subscribe to these personalized, sitewide feeds:

Hot content this week