Comparison of shared files directory solutions

We encourage users to post events happening in the community to the community events group on https://www.drupal.org.
You are viewing a wiki page. You are welcome to join the group and then edit it. Be bold!

If you want to have multiple webservers for a Drupal site you need to keep your code and files directory in sync across all the webservers.

  • For synchronizing the code solutions like capistrano, drush deploy, fabric, phing are all good solutions. Sharing the code using a network file system doesn't usually perform well because the files are re-read (or at least stat'd) during page load which is slow.
  • The files directory generally contains important dynamic assets that must be synched across the servers such as user uploaded files and the combined javascript/css files. In this case some network file system or synching technology is necessary.

Previous discussions:

There are a few previous discussions on this topic over the years.

  • November 2008: Load Balanced Servers Questions - Suggestions focus on nfs+symlink the sites directory
  • March 2011: Best practice shared file system - NFS3 is working but they are hitting some limitations. Considering GlusterFS, Mogile, NFS4, Lustre - most commenters recommended GlusterFS.
  • December 2011: GlusterFS and Drupal 7 horizontal scaling - Suggestions include: Keep code files local, but share the files directory. Use NFS, lsyncd, sshfs/fuse, DRBD, for files directory and try to offload the files directory to Varnish or a CDN so that each request for a file isn't hitting the shared directory.
  • March 2013: Shared DocRoot For Redundant Web Servers - The idea was to put the drupal root in S3 instead of using NFS. 1 commenter disliked the idea.
  • November 2013: The Drupal module S3 File System adds the ability to store uploaded files in an Amazon S3 bucket by adding a new Drupal filesystem alongside public and private.
  • June 2014: The Drupal module Storage API adds the ability to store uploaded files in multiple containers, such as local filesystem (can be used together with other mounted directories), FTP, S3 and Database. Since it's considered as an API there are other contrib modules extending it's backends. Storage API adds stream wrappers and has a bridge to core file and image fields (including image styles). Since Files can be stored on multiple containers there are various configurations possible, such as storing uploaded files first locally, migrating them to multiple storage backends, provide backup methods and failovers, add more capacity by only uploading new files to a new server marked for population, while still serving already existing/ uploaded files from the old backends.

Comment on technology choices:

  • GlusterFS: Used by Acquia and several other companies.
  • NFS: Seems to be a commonly recommended solution. Drupal.org uses nfs.
  • lsyncd: works well if you like lua
  • rsyncing in cron and sticky sessions in the load balancer: good if you like rsync, may not work well for js/css unless js/css are served from a cdn.

Comments

We've mostly used NFS with

Mark Theunissen's picture

We've mostly used NFS with success, it just adds a single point of failure.

Pantheon uses their own custom system called Valhalla: https://www.getpantheon.com/blog/why-is-high-availability-critical-for-D...

I think NFS is the best

erikwebb's picture

I think NFS is the best "common choice," mainly because every sysadmin knows how to set it up. Tools like GlusterFS are great, but they really require some new knowledge. When the dev team is not also the hosting team, I lean toward a more supportable solution.

Take a look at Maginatics MagFS

braimond's picture

It is not open source, but it is extremely simple to setup, uses S3 as a backed (with the advantages of not using fixed size volumes like EBS), achieves often better performance than NFS for in-cloud deployments (thanks to the sophisticated caching and deduplication algorithm in use) and with a single compute node dedicated to the storage cluster (2+ in an HA configuration) can be used by hundreds to thousands compute nodes whether these are web servers or life sciences computational nodes or rendering servers. Today some of the largest AWS customers are using our technology in production for the exact use case described in this thread. Feel free to drop me a note at braimond@maginatics.com to discuss further, if interested.

Looks interesting, how much

mdekkers's picture

Looks interesting, how much does it cost?

Local Code Issues

nerdcore's picture

One of the suggestions brought up among a few of us sysadmin types during DrupalCon Portland, and agreed by many to be a good solution, is to place the bulk of the PHP code on the web servers directly (their own EBS), and share the sites/ directory or sites/*/files/ directories through GlusterFS or NFS (I've heard bad things about GlusterFS performance).

Here are my issues which I would love to get more feedback on:

  1. Much of the code of our Drupal sites resides in sites/all/modules/ so although placing Drupal core on each web server is fairly straighrtforward, if I share the entire sites/ directory we are back to the case where much of the code is being executed across a network share. If instead I were to NFS export each "sites/*" folder, this would create an increased management workload when adding new sites or migrating sites between servers.

  2. When I tried using NFS for the entire DocumentRoot (Drupal core, all contrib modules, and all sites/*/ directories) it totally blew up in my face. "Stale NFS file handle" errors on all web servers after very little uptime, and largely the solutions I came across for this error were "remount the filesystem (by hand)" or worse yet "restart the NFS server service (by hand)". None of this is acceptable in a live server environment, but I am very eager to address these issues and have some files shared across servers.

  3. I am intrigued by the symlink suggestion in http://groups.drupal.org/node/16474 and would like to follow up on this idea now in 2013. Anyone using this today?

Used NFS in Production with acceptible results

mikepvd's picture

Environment:
Amazon EC2 - EBS backed behind one ELB
Single File Server (NOT a webhead) with up to four webheads
NFS4 sharing 2 paths
Webheads running APC with no stat, Ubuntu 12LTS 32bit
Apache.. yadda yadda
Cloudfront CDN caching

When I first set this up I had a few of the stale file handler issues that the poster mentioned, however it may have been related to not having the idmap domains all set up properly. After a few days that issue went away. What I did have a problem with was with NFS leaking memory on the host. A patch near the beginning of the year fixed that issue.

This setup served the code (entire Drupal folders) as one share and the files as another. I had the files for all the sites symlinked into a folder outside the document root.

This setup serves approx 750k - 1 million hits a month of combined user and bot traffic across 8 sites. What probably helps is the webheads only need to pull the files once across NFS unless apache gets rebooted (code deployment) and Cloudfront caches most static files for at least 24-48 hours so file access is mostly saving uploaded assets (no user content), image-cache creation, and style updates (css, js clears).

If you have your codebase

coredumperror's picture

If you have your codebase hosting handled, but need a way to share just uploaded files, you might like S3 File System. It adds a new filesystem alongside public and private, which stores the files in an S3 bucket. The module caches the metadata for all the files in that bucket in the DB, making stat()-like filesystem calls much more performant than some other solutions.

It's still under very active development, so if you're interested in support for any additional features, please come by the issue queue and say hi.

Background writes

mikeytown2's picture

If you want to have writes for things like CSS/JS aggregates and image styles be in the background my 2 modules can accomplish this
https://www.drupal.org/project/imageinfo_cache
https://www.drupal.org/project/advagg

hook_image_imageinfo_cache_save($image, $destination, $return) can be used to mirror image styles. It will also pre-generate image styles and can be used to lockdown the on demand image style generation so only the image derivatives that were pre-generated can be used (note that the original image is still accessible on a public file system). Also comes with a drush command to help with the image style generation. This module will also cache the internal parts of image_get_info() so no file I/O is done when this is called (needs this core patch https://www.drupal.org/node/2289493 due to the filesize() call).

hook_advagg_save_aggregate_alter($files_to_save, $aggregate_settings, $other_parameters) can be used to mirror CSS & JS aggregates (also see https://www.drupal.org/node/2143913#comment-8273745). If your file system is slow you can set $conf['advagg_fast_filesystem'] = FALSE; to do checks like if the file exists in a background thread if HTTPRL is installed; also noted that if using the aggressive cache for AdvAgg, zero file I/O is done by PHP for CSS & JS.

I would avoid glusterfs

jjozwik's picture

I would avoid glusterfs mounts for drupal hosting other than site folders. Mounting via NFS on a glusterfs share looses auto failover but it will yield much better performance.

Drupal code rely heavily on file_exists that can not be cached in apc clustered file system performance is very slow. When glusterfs gets a file_exists check it has to verify the file exists with with the quorum.

I originally planned on deploying over glusterfs for drupal but after extensive testing it was just too slow.
Glusterfs is awesome but not a good fit for drupal code.

Amazon EFS

dotsam's picture

For those deploying on Amazon Web Services it looks like Amazon Elastic Filesystem will work really nicely for synchronizing multiple drupal webservers.

https://aws.amazon.com/efs/

"Amazon Elastic File System (Amazon EFS) is a file storage service for Amazon Elastic Compute Cloud (Amazon EC2) instances. Amazon EFS is easy to use and provides a simple interface that allows you to create and configure file systems quickly and easily. With Amazon EFS, storage capacity is elastic, growing and shrinking automatically as you add and remove files, so your applications have the storage they need, when they need it."

"Amazon EFS supports the Network File System version 4 (NFSv4) protocol, so the applications and tools that you use today work seamlessly with Amazon EFS."

Thanks for posting this!

greggles's picture

I saw that too and immediately though about using it for the files directory. For what it's worth, we were using nfs for a while and then switched to gluster. Gluster required more RAM on all the server, but we moved to it b/c it was seen as being more scalable for additional webservers.

I'm really interested in

jackfoust's picture

I'm really interested in this, but as of Nov 28 it is still preview and only available in the Oregon region if they do accept you for preview =(

EFS is now generally

greggles's picture

EFS is now generally available, but still in limited regions.

There's an article from Metal Toad comparing Gluster, EFS and SoftNAS.

Anyone have more real world experience?