Scalable static file hosting and some thoughts on Amazon S3

Events happening in the community are now at Drupal community events on www.drupal.org.
boris mann's picture

So, one of the big things with static file serving is that it requires an entire Drupal bootstrap, which sucks up a lot of system resources. We have work to do in Drupal/the File API to support some interesting scalability options, remote files, etc. etc.

But, I think there are some interesting models for doing highly scalable static file serving outside of Drupal (and Amazon's S3 is cool).

For the first architecture, consider the following, for the domain example.com:
* 3 (or more) front end web servers with Drupal installed (identical codebase checkouts, call these www1 - N)
* a database backend (could be a cluster or just a big high performance machine, call this db1 - N)
* Drupal files and tmp directories mounted via NFS from a separate machine (can either be a NAS or the static file serving box itself)
* a static file serving machine (call this static1 - N)

To Drupal, all file operations are local -- the mounted NFS system is configured to the local file and tmp settings, and it thinks it's writing to a local file systems. So, no changes to Drupal code required at all.

Now, in actually serving up the files, the front www* servers have Apache mod_rewrite configured to redirect all requests for files (e.g. *.jpg, *.doc, etc.) to static.example.com. This doesn't need to run PHP (or even Apache) at all -- it is completely optimized for serving up static files.

That's it. I don't (yet) have an example of the mod_rewrite rules for this, but I think this would make an interesting "recipe" for how to serve up lots of static files without any Drupal performance penalty. Also: this doesn't HAVE to be used with a cluster -- it can even be used with a single machine: one vhost runs Drupal, and another runs the static file web server. But even a two machine minimum would provide quite a performance increase.

My thoughts on Amazon S3? First, for those that don't know about it, here's the quote from the page describing it:

Amazon S3 is storage for the Internet. It is designed to make web-scale computing easier for developers.

Amazon S3 provides a simple web services interface that can be used to store and retrieve any amount of data, at any time, from anywhere on the web. It gives any developer access to the same highly scalable, reliable, fast, inexpensive data storage infrastructure that Amazon uses to run its own global network of web sites. The service aims to maximize benefits of scale and to pass those benefits on to developers.

So, you get dedicated file storage from Amazon with a web services front end and good bandwidth rates. Also, they have some very cool extra niceties, like you can request any file with ?torrent on the end to get a BitTorrent seed file automatically. A full on Drupal module that did some more interesting things would be exciting, but is not the point of this post...

So, imagine a server-based script that synced those same file and tmp directories to Amazon's S3 service, and the mod_rewrite script pointing at the Amazon S3 space. Now, you're not even using your own bandwidth or file storage, and you have predictable costs associated with both bandwidth and storage.

However, there is a gotcha with this that needs some work. Basically, it's going to take X amount of time for a script to transfer new files to the S3 service. During that time frame, mod_rewrite needs to NOT forward requests for just the files that haven't been synched yet. So, the synch script probably needs to dynamically update mod_rewrite (insert hand waving here) waiting for the files to synch.

Comments, questions, offers to make some sample scripts :P ?

Comments

Amazon S3.

dopry's picture

Their price comes to about $75/Mbps which isn't bad considering you dont have any of the administrative overheads of managing your own infrastructure. Same for the storage capacity...

This is what we need the storage leve drivers for...

So far storage drivers need to implement only a small set of features...
file_exists, file_copy, file_move, file_remove, file_write()...
...
Shouldn't be too terribly hard to do... but with this s3 thing we'd have to figure out someway to transfer files incrementally, and flag when the transfer is completed. (session or db and saving a file pos would probably do the job, also been thinking about this in terms of large uploads... doing multipart file uploads through http post / AJAX...)

storage drivers...

dopry's picture

They're working.

.darrel.

s3 Libraries

arthurf's picture

You might want to check out: http://blog.apokalyptik.com/Storage3/

I think it'd be pretty simple to set something up to integrate with your driver setup this way- eg:

$s3=new storage3($myAccessKeyId, $mySecretAccessKey, $url);
$s3->mkBucket($bucket);
$s3->putFile($file_path, $bucket, $file_name);
$s3->setACL($bucket, $file_name);

print "<img src=\"http://s3.amazonaws.com/" . $bucket . "/" . $file_name"\" />";

pricing is 8 dollarcent per Mb per second

bertboerland's picture

75 dollar per Megabit per second isnt cheap, in fact it is expensive.

and most of all, it isnt true for S3! :-)

20 c per GigaByte per Month
200 c per Gigabit per Month (yes I know there are only 8 bits in a byte, but it doesnt specify encapsulation like start and stop bits nor who is paying for the packetloss from S3 to client)
204800 cents per Megabits per Month (or)
2048 dollar per Megabit per Month
6606,45 dollar per Megabit per day (/31)
275,26 dollar per Megabit per hour (/24)
4,48 dollar per Megabit per minute (/60)
0,07 dollar per Megabit per second (/60)

hence 7.6 dollarcent per Mbps, and that /is/ cheap. In fact, even if you buy 1 Gigabit and do some opportunity based costing including routers, BGP maintenance, RIPE/ ARIN overhead etc, you wont come under the 10 dollar per Mbps. So offloading your data here is a huge scale advantage!

note that it doesnt specify how much the bandwidth is that is available so you cant really do this kind of math nor is the latency specified. most of all, it doesnt specify if the pricing is half or full duplex!

--

bert boerland

--

bert boerland

Hadn't thought about the need to share tmp directory

mfb's picture

I had been thinking that tmp files are used only over the lifetime of one discrete request.. Do they in fact need to be shared across all servers, instead of local to each webserver?

Btw, you could do this the other way around, e.g. one or more frontend webservers without PHP installed; requests for any dynamic pages (anything outside of /themes /files /misc etc.) is proxypassed to one or more backend drupal webservers with PHP. Again, drupal servers have files mounted via NFS from the fileserver so they can see the files.

Using FUSE

Egon Bianchet's picture

A nice idea could be to use a FUSE plugin to mount S3 like a hard disk, there is someone working at it, see this forum thread and the project page.

Not everyone can..

dopry's picture

Not everyone can mount userspace fiel systems on their servers.. I'd prefer to keep most everything local, except of outrageously large media files.

.darrel.

This has HUGE potential

Veggieryan's picture

http://www.neurofuzzy.net/2006/03/17/amazon-s3-php-class/
http://freshmeat.net/projects/storage3/
http://developer.amazonwebservices.com/connect/entry.jspa?categoryID=47&...

Looks like everyone has the same idea...

... I can just imagine logging into my drupal site, dragging and dropping a few hundred megs of a music project file (10 wavs at 50megs each)...and it appears as if it were on my drupal install.. with NO performance loss???!?!??!???!?

HOLY COW...
Lets setup a donorge for this... how should it be implemented? as a replacement to upload.module? a helper to upload.module? Boris, you are saying you could trick drupal into thinking s3 was the localhost... THAT would be cool....hmmmm...(gears churning..)

yes... s3 is grrreeat!.
thanks!
ryan.

FileAPI 2.0

boris mann's picture

Ideally, a new FileAPI could support plugins of storage, including remote storage like S3.

It might just be a fork/clone of upload module for now, demoing what a FileAPI 2.0 could do -- handle both local storage as well as remote, with S3 being one of the remote options.

wow...

Veggieryan's picture

I am stoked on this idea
How long would it take you to put up a working prototype?
Can I help test it?

woah..
ryan.

I just handwave

boris mann's picture

I just architect, I don't code (sorry!). I'm also half decent at getting together funding for things. If you are interested, how much would you be willing to donate to get this started?

Well, and what we really need to do is make a separate post on how exactly an Amazon S3 module would work and what it would do. What I described actually had more to do with system-level setup rather than a module.

Here's one approach:

It could be a completely separate node type -- amazon s3 nodes. You would upload files to your Drupal and/or select files on the file system where Drupal is installed, which would then move it to Amazon S3. Once it had been copied across, the node would automatically publish and/or change state to indicate that links now work. I would probably use a redirect to the Amazon file so that the permalink would still be your Drupal site.

That would be most appropriate for larger files, and wouldn't necessarily help with, eg., podcasts and videocasts, since it wouldn't do anything with the upload.module.

Me too..

Veggieryan's picture

I really need to break down and make code instead of hacking it... just aren't enough hours in the day.. by the time i setup a site and visit the issues queue to fix everything, i have already overspent my hours...
I could only throw a few hundred at this right now.
I really think it needs to be a field/widget for CCK... that is where it will be most usuable.
I don't think the files should have to go through the localhost... is that really necessary? seems counter-intuitive.. why put the load on your own server? the whole point of S3 is to unload the work on amazon...
For the CCK field I would like to have the option of allowing multiple files uploading at once via a java or flash drag and drop.. this is crucial for normal people to upload many files... they don't like doing one at a time...
I had also outlined the possiblities of the multiple file uploads creating multiple related nodes via CCK node relations in this post: http://drupal.org/node/57014
which then lead to some work by dopry in CVS that is leading to a unified wrapper for media in cck... this could all work together nicely... http://cvs.drupal.org/viewcvs/drupal/contributions/modules/filesystem/RE...
To allow any site to host unlimited media with no performance loss... anyone can compete with the big boys for very little money..
This is my dream. For my pet project, it will allow me to create my own rocket network where musicians are collaborating on huge song projects at 500+mb each without killing my dedicated box..
If anyone wants to get a basic Amazon S3 - CCK filefield module working.. let me know.. Boris could help getting funding too!

thanks!
ryan.

Oh how should it be...

dopry's picture

I've got the rough basics in place to get this going... my filesystem.module can now upload a file, manage/unmanage files, move files and has a very simple filebrowser... It should be elementary to abstract the function calls. I've already annotated in the code where system calls need to be changed to use the filesystem_invoke_storage_driver.... just need someone to test some of this stuff an post patches... I'm terrible at testing my code some days, and my bandwidth is limited for working on this at the moment.

Still Needing Testers?

mpare's picture

Dopry, if you are still interested or needing testers I am more than willing to help test. I am not the fastest or advanced drupalist or programmer, but I can follow. I will do my best to find where issues lye and help patch.

Peace,

Matthew Pare

Pare Technologies
info at paretech dot com

www.paretech.com

Peace,

-mpare

Pare Technologies
Drupal Consulting, Themeing, and Module Development
806.781.8324 | 806.733.3025
www.paretech.com

Figure Something Out? Document Your Success!

This would be good, I would

gordon's picture

This would be good, I would like the file api to allow for modules to control access.

My case for this, is that I would like to turn a pieice of content into a product, and then the attached file will not be able to be downloaded unless the user has purchased the product.

--
Gordon Heydon

Filemanager/attachment cheerleader

robertdouglass's picture

I just want to remind readers of this thread that the filemanager/attachment modules already offer downloads that respect the user permissions of the parent node, so any node that you can sell can have private files that are only available to those who've paid.

Yes

boris mann's picture

And the problem with filemanager/attachment is that it needs to be reworked as a patch/patches against core. It's fine (perhaps) for specific web sites, but we want this general functionality in Drupal core.

Is this really true?

Dave Cohen's picture

Is this really true? Remember that when I create a product, the node itself is viewable by all (otherwise, how would they buy it?) But the file associated with the node should be viewable (downloadable) only by those who have paid. In other words, the permissions on the node and the file are not the same.

Minor problem

robertdouglass's picture

that's an issue of managing the user rights in the desired fashion and making an e-commerce workflow that supports it. The technical problem mentioned is solved, one way or another. The only time I've used e-commerce to sell file access, it was access to a flash file and I used a role-based permission module with filemaker/attachment, and it worked well.

Could you break down the

Will White's picture

Could you break down the file process into 3 steps?

  1. Upload (FTP, HTTP, ect)
  2. Storage (Local, remote, S3, ect)
  3. Delivery (Direct, BitTorrent, ect)

From the looks of things, there are different solutions for each of those steps and allowing one to mix and match would be excellent.

--
Will White

I have a quick proof of

Will White's picture

I have a quick proof of concept s3 module. It allows you to add and browse buckets and files from the administration pages. I could evolve into an API that other modules would extend. I did it in a day so I don't know how bullet-proof it is, but I'll share if anyone is interested.

--
Will White

Does S3 or your module support access control?

Amazon's picture

I assume S3 is a big bucket of storage. Does it have access control that would allow me to store files for each site separately?

Kieran

To seek, to strive, to find, and not to yield

< a href="http://www.youtube.com/watch?v=COg-orloxlY">Support the Drupal installer, Install profiles, and module install forms
<a href="http://ia310107.us.archive.org/1/items/organicgroups_og2list/dru

You use buckets to organize

Will White's picture

You use buckets to organize your files. So I suppose each bucket could represent a different site. There is access control although I have not experimented with it yet. I'll start implementing that now.

--
Will White

Plug into filesystem

boris mann's picture

Darrel Opry's filesystem module, which he is aiming to get into Drupal core (perhaps for 4.8) has the concept of multiple file back ends. You can see his design on his site.

Also, a recent patch by Arto Bendiken (see http://drupal.org/node/74472) could also enable transfer of files directly from S3 -- click on what looks like a local file, and the file download method is intercepted and returned from S3 instead.

I'll start looking into the

Will White's picture

I'll start looking into the filesystem module. The main issue I'm seeing now is during the upload process. As of now, the module must upload files to the local server first before sending the data to Amazon. It would be great for performance and PHP restrictions if the files could go directly to S3. Others are talking about the same issue on the Amazon Web Services forums but suggestions like using a Flash uploader or a 307 redirect don't really work for us.

--
Will White

boris mann's picture

What I came up with, and what I think is now possible with Arto's filehandler patch, is a Drupal permalink that "returns" the location of the file.

So, it might look like www.domain.com/filesystem/file/1234 and be stored temporarily (optionally: permanently as a "backup") on the local server. Once the file has been side-loaded to S3, then requests for that URL pass back the S3 location.

What's the status?

Amazon's picture

I am clicking on a lot of links and see comment activity on this project. It is unclear to me where and who is actively doing development or how we can collaborate to get it working.

Could someone summarize where the patches for S3 support are being implemented, and who the active lead devs are?

Kieran

To seek, to strive, to find, and not to yield

< a href="http://www.youtube.com/watch?v=COg-orloxlY">Support the Drupal installer, Install profiles, and module install forms
<a href="http://ia310107.us.archive.org/1/items/organicgroups_og2list/dru

Check dopry's filesystem module

boris mann's picture

He's putting the finishing touches on an S3 driver for it. See http://cvs.drupal.org/viewcvs/drupal/contributions/modules/filesystem/ -- last commit to S3 about 2 days ago.

Status check-in again

Amazon's picture

Hi, is anyone using S3 in production?

There are a few of us who want to see this done and can support it getting completed.

Kieran

To seek, to strive, to find, and not to yield

< a href="http://www.youtube.com/watch?v=COg-orloxlY">Support the Drupal installer, Install profiles, and module install forms
<a href="http://ia310107.us.archive.org/1/items/organicgroups_og2list/dru

Interview w/ Jeff Bezos

gusaus's picture

This may be interesting to some:

Amazon founder and CEO Jeff Bezos has been talking about their web services business unit a lot lately. Moments after he left the stage at the Web 2.0 Summit last week I was able to speak to him about three of their most recent web service offerings: Mechanical Turk, Simple Storage Service (S3) and Elastic Compute Cloud (EC2). This is a short podcast but you get a glimpse of how important this new business line is to Amazon’s future...

http://www.talkcrunch.com/2006/11/14/interview-with-jeff-bezos/

Gus Austin

Amazon S3!!

chriscm's picture

Amazon S3 looks quite amazing i'm surprised at the audience that will be using I've heard reports of it being highly scalable


Clearance Rack Support - Toronto Dedicated Servers

Clearance Rack Support - Toronto Dedicated Servers

Wordpress uses it

Amazon's picture

Wordpress has a plug-in: http://tantannoodles.com/toolkit/wordpress-s3/

Wordpress.com is using it as primary storage, but has a Varnish layer in front of it: http://photomatt.net/2007/10/09/s3-news/

He noted that S3 is not saving him any money, but simplified some of his requirements.

Cheers,
Kieran

Freeware tool to manage S3 for Windows users

cloudberryman's picture

If you are on Windows you can use CloudBerry Explorer for Amazon S3. With FTP like client it makes managing files in S3 EASY http://cloudberrylab.com/ It supports most of the Amazon S3 and CloudFront features and It is a FREEWARE.

File API

Group organizers

Group notifications

This group offers an RSS feed. Or subscribe to these personalized, sitewide feeds:

Hot content this week