Hi all,
I want to build a book review site that will start with millions of nodes -- every book is a node. Meanwhile, logged in users can create nodes as they like, for example, to add new books. I also need to use organic groups, and group members will be free to post contents (which are also nodes) in the group.
Although I don't have any user yet, but I assume in the future if I have a decent number of users, I might be having hundreds of millions of nodes. For example, in www.goodreads.com, they have over 78 million of books added, and this number doesn't yet include group posts, user blogs, and other stuff that users post.
My question is, is drupal a possible choice to handle hundreds of millions of nodes?
My php skills are low so I will need to find a drupal expert to build my website. But would like to make sure I won't run into a performance problem at least before, say, I reach 100,000 users and 100,000,000 nodes.
Thank you!
Comments
Millions of nodes
I know Drupal can handle around 5 million nodes; but each node does slow down MySQL so things could start to get very slow once you hit the 100 Million node mark. If you can run a multi-site and split up the books in some logical fashion that would be a good idea. I know from experience that 100k users will take some thought. The quick fix when dealing with lots of nodes is caching; I know boost handles millions of nodes, I don't know how apache will work with 100 million html file though.
Biggest issue is MySQL from my point of view.
Thank you! Does that mean
Thank you! Does that mean drupal is not suitable for not moderated community websites? Since it is not uncommon for a decent community website to have 100,000 registered users. If on average every user creates 20 nodes (contents, blogs, open discussion topics), it would already be adding 2 million nodes and start to slow my site down?
drupal.org
drupal.org has over 700k users, so it's doable; you just need to think about how everything is going to work. First step is to use Pressflow, as this contains most of the patches d.o uses.
I don't think apache & both
I don't think apache & both ext3/4 will have problem with millions of files in boost cache since they are always divided into a lot of directories, based on paths. You might want to take that into account when designing the application and make sure you include something like YYYY/MM/DD in your node paths (using pathauto), therefore the cache files will get divided into dirs per day.
(I would consider caching on a proxy level with varnish, putting everything into memory - requires Pressflow or D7 for anonymous no-session users)
I also think this problem is very complex and you might be forced to use things like mysql table partitioning / sharding. Drupal 7 might get better with this since it supports mysql master/slave natively.
Anyway, this seems like a very cool project, I would really like to know what happens when you really put 100 million nodes into Drupal...I should try sometimes :)
Database overload
The issue here is that each node added, things get added in several tables (node, node_revisions, ...etc.).
What really can make a difference is whether there are lots of node related modules on the site or not.
Each added module (e.g. voting, widgets, ...etc.) cause extra stuff to be looked up in the database when a node_load() is done.
The same goes for CCK as well, since it adds fields to nodes, and tables as well, which need to be looked up when a node is loaded.
So, a site with less of those modules would be able to handle more nodes than a site with more of those modules.
Also watch for pager queries, whether they are from core or from views: they can slow down considerably when you have joins to term_node and such.
Drupal performance tuning, development, customization and consulting: 2bits.com, Inc..
Personal blog: Baheyeldin.com.
Don't use nodes for books?
Would it be a better way if I don't use nodes for books? If I can store books in other tables like "fictions" and "non-fictions", instead of putting into nodes. And then find a developer to develop book-related operations specifically for my website, and hook up with other parts of drupal. Would it be much better?
Custom content for speed?
@dontmcyn... I have tried this (putting custom data in a custom table) and it works very well.
Recently I had to build a site with many items (which would usually be entered as nodes) which the client wanted to load up via a spreadsheet import. OK, this was only 50,000 rows or something but bear with me, the method should scale:
Imported this sheet (with many columns of course) into a custom table with indexes on specific columns required for lookup and just wrote a simple module to go fish out info from this single table and theme it when a specific path (menu item) is hit.
For speed, this is simply dependent on the underlying DMBS so make sure MySQL (or whatever you choose) can cope with this. As you suggest, using multiple tables based on some attribute of the content would allow the rows to be split. I guess this would be akin to implementing a very simplistic form of what Oracle calls partitioning (I haven't even looked to see if MySQL can do that).
Clearly you loose some functionality if planning to use other modules that require nodes to work, so this might not work for you, but it's an interesting approach that can be done if the site you're building is simple enough.
Thank you for sharing your
Thank you for sharing your experience!
For speed, this is simply
This is what's known as table sharding. And DB consultants generally recommend this only as a last resort as it causes as many headaches as it solves.
Keep in mind that there is a half-way point between CCK-style nodes and a custom data storage. You can create custom nodes. This is the way everything was done back in the days of Drupal <4.7. That way you can still add things like 5-star, comments, subscriptions etc. But you manage all the data handling in whatever way is most efficient for your use case.
--
Dave Hansen-Lange
Director of Technical Strategy, Advomatic.com
Pronouns: he/him/his
I'd do use nodes
Of course you could get a special module developed that does its own content management, but I'm afraid it would have to duplicate a lot of functionality built into Drupal, while sharing the same problems that can be solved by caching and good database design.
Jochen Lillich, CTO freistilbox Managed Drupal Hosting
Does using more than one MySQL Databases solves the problem?
Hi,
I just want to know that if I divide my drupal installation into more than one database, does this increase performance or not?
For example my current drupal installation contains 165 tables in one database. If I split this databse into 3 or 4 databases with round about 50 or 60 tables each, will this helps?
DIMSKK
Don't think it will help.
Don't think it will help. Beside it might be hard to get to work with drupal.
Anyway if you have your mysql-server under heavy load you might want to look for load balancing ex. by putting 2 server up.
I can't imagine that it would
I can't imagine that it would be, particularly since MySQL uses a client/server architecture. Four databases means potentially four times the network connections… Maybe not too much of an issue if they're just using Unix sockets, but still sounds like a lot of work for no real benefit.
The Boise Drupal Guy!
In addition to the increase
In addition to the increase in connections, there is also decreased performance of prefixTables() when $db_prefix increases in size. See [#561422] for benchmarks, and a proposed fix that needs to be reviewed.
--
Dave Hansen-Lange
Director of Technical Strategy, Advomatic.com
Pronouns: he/him/his
Errr what am I talking about.
Errr what am I talking about. There would be no increase in connections. With $db_prefix all tables are accessed via the same connection. Still the decreased performance of prefixTables() though.
--
Dave Hansen-Lange
Director of Technical Strategy, Advomatic.com
Pronouns: he/him/his