Separate Old Content

Events happening in the community are now at Drupal community events on www.drupal.org.
kmoll's picture

Hello, I have been developing with Drupal for a while now, but new to the high performance side of things. I am working with a site with about 105,000 nodes and keeps growing at a fast rate. But most of the nodes get put in an archived state after a week or so. So most of the content on the site is very up to date, but we want to keep the old content because search engines have indexed them, and our authors and content providers look up their old content to reference in new articles. But the views for the site filter out the content based on the value of that archived field. So they have to search and filter through over 100,000 records just to find the few records they are pull up for the site. We have many levels of load balancing and caching on the site (with boost) to speed things up, but there is one section of the site that needs to be rebuilt when at certain times when new content is added. In our environment, we have huge spikes in traffic during certain events. During these events our content authors are usually posting articles and blogs which triggers these sections to be rebuilt, while we get huge spikes in our visitor traffic, and while we are under heavy load these queries are taking a very long time to execute.

I have been doing a lot of research and reading through many of the posts on this group, and have gotten some great information. I am wondering if there is a way to export these archived nodes to either new tables or a new database, so that all of the queries run on the site will only filter through the active content, while to few situations where the old content needs to be accessed it is still available. I have thought about serving all this content as static html files as I read in one of the posts here, but that won't allow me to have all the menus and blocks updated with the new styles and content.

I am figuring some other organizations, especially news organizations must have run into this problem before and I am wonder what people have done to get around this problem. One thought I had was creating another instance of drupal where I would create a migration path to copy all non-node specific data and only the node specific data for the archived nodes and changing the path for those nodes, and setting the load balancer to serve those paths from the other instance. This would mess up the indexing for search engines, but we would update our site map to hopefully account for it. Not ideal SEO wise, but just a thought to try to limit the number of records each query has to run through.

I know I have alot of query optimization to do, but I think if I cut the number of records that get searched and filtered by 90% then I could accurately asses my queries to determine how to further optimize them.

Any thoughts or suggestions would be greatly appreciated,

Thanks,

Comments

I think doing some query

dalin's picture

I think doing some query optimization and database tuning would be a heckufalot less work than any of the other options that you are proposing.

Also you might consider using Varnish instead of Boost. You could write a bit of custom code that would set really high max-age for the archived pages.

Boost may also be the cause of your high loads when new content is created.

--


Dave Hansen-Lange
Director of Technical Strategy, Advomatic.com
Pronouns: he/him/his

I agree that that query

moshe weitzman's picture

I agree that that query optimization is your best bet here. At some point, your page caches need to clear for new content. And you have to be able to sustain load while that happens. You can certainly work on caches lasting longer (making sure cache_clear_all() does not blow everything away by enabling min lifetime), but the above fact remains.

This might sound too easy,

fabianx's picture

This might sound too easy, but have you looked at the explain of the query or can you even share one?

If its just the large range-set it needs to look at then you could perhaps just hack the queries to be restricted by the date. While that does not absolutely remove the large table penalty it helps if the fields are properly indexed.

So adding a hook_db_rewrite_sql and adding a condition like:

$return['where'] = 'n.created > $archived_item_timestamp;
return $return;

for all wanted queries, but the search function (should use SOLR for that anyway), could perhaps speed up those queries quite a lot and have those nodes be "archived" for all purposes but viewing or editing ...

Also adding slave DBs for reads could perhaps help, if its the reads that are slow.

It it is the writes, you might consider caching those regions in memcache instead ...

Best Wishes,

Fabian

I agree

kmoll's picture

I do agree that I need to go through all the queries, and find the troubled ones, and check the indexes. I have been researching alot of those techniques and I am sure they will help out a quite a bit. I'm not a mysql expert, but hopefully I am becoming one at a quick pace. But the theory I can't quite let go is the fact that the our usable data set remains constant, every day new articles are added, old ones are archived. So over time our table just grows and grows. Even with optimizing queries and making sure indexes are created properly, the amount of data that still has to be searched through is growing exponentially. I feel that the best way to optimized them is to keep the data set that is searched as small as possible. We have done tests with the queries on just the size of the data set that is "active" and the queries took a fraction of the time. To me the biggest performance gain we can get is limiting that data set.

Maybe I'm missing the mark here. Again, I do plan to scrutinize most queries and take every effort to tune the database server for optimal performance, but as time goes and we keep growing I feel that even optimized queries are still going to get slower over time with the amount of content that is added. I know there are some news organizations out their that use drupal, and I am assuming they face the same problem. Maybe not. I do appreciate the quick responses, and I plan on using your suggestions to make our site as fast as possible.

The whole point of database

moshe weitzman's picture

The whole point of database indexes is that they are 'precomputing' the data for filters and sorts such that they are fast no matter how much data is in the table. Table size only matters when you are dealing with tables without proper indices or with queries that can't use indices for various reasons.

Also, Drupal simply is not engineered to do what you describe. There is no place to put archived content.

yep

catch's picture

Also, trying to store old content completely separately, is close to what is usually described as 'sharding'.

Some document databases have that feature, I saw a MongoDB developer's blog post about sharding where the first line in big letters was "don't use it". It should not be necessary to even think about things like that until you have hundreds of millions or billions of items to store.

MySQL also supports sharding.

dalin's picture

MySQL also supports sharding. But agreed, it is doubtful that the OP's DB is so large that it would even see any performance gain from sharding.

--


Dave Hansen-Lange
Director of Technical Strategy, Advomatic.com
Pronouns: he/him/his

I would investigate solr

micheas's picture

If you can get search_api to work with your site, using solr (on the default lucene search engine, or whatever other nosql database you prefer) would solve your problem.

With 300,000 nodes search_api_solr with solr running on tomcat on a virtual machine with 512M ram. the searches are not noticeably different than loading any given node.

I am basing this suggestion based on the assumption that searching is the only performance problem you are having.

apachesolr_search is more functional today, but the long term future is clearly search_api. Getting search_api_solr, and apachesolr_search to play nicely with each other is a bit of a challenge, with a lot of documented gotcha's (read the bug reports, there are workarounds for the problems you will have if you try and use both of them on the same site.)

Hi micheas, we do use solr

kmoll's picture

Hi micheas, we do use solr for searching. I have done alot of investigating and all these guys were correct, most of the performance issues lie in queries taking to long, not necessarily the amount of data in the database. So I am trying to do some massive query optimization as well as find some other bottle necks that could occur in the way the code is written. Hopefully we will see some huge performance gains. Thanks everybody for your help and advise.