Some more thoughts on Amazon S3 and EC2...

We encourage users to post events happening in the community to the community events group on https://www.drupal.org.
matt@antinomia's picture

Greetings! I don't intend for this to be an advertisement for Amazon, so if folks want to discuss other similar services that's totally cool, and we might even change the name of the group. But, I've been spending (too many) hours researching Amazon Web Services recently and thought some people might be interested to know what it's all about and how they might be able to use this stuff...

Boris made a post last April that captured some interest with regard to using Amazon Simple Storage Service (S3) with Drupal. Since then, Amazon has released the Elastic Computing Cloud (EC2), which allows one to deploy virtual servers (equivalent to "1.7Ghz x86 processor, 1.75GB of RAM, 160GB of local disk, and 250Mb/s of network bandwidth") on the fly in seconds. The services work together, and there is no fee for data transmission between S3 and EC2. Pricing is reasonable for both services, billed at a metered rate determined by your use of resources.

So, what does this mean? Well, a lot of things... I've thought of a few ways you could use this. I'm sure you can think of more, and that's what this group is all about.

A few references if anybody is interested: First, dopry's fileapi.module which will theoretically support using S3 as a file mount point (i.e., use S3 instead of the /files directory). Second, a patch I made to the backup.module which allows you to back up your Drupal site regularly, automatically, and effortlessly. This has been invaluable to our company in making sure all of our clients sites (on various servers and hosts) are regularly backed up. It has some PEAR dependencies (HTTP_Request, HMAC_Crypt) that will likely keep it from going mainsteam, but it's interesting nonetheless.

I've also been talking with arthurf about his media_mover.module, which will basically allow YouTube functionality within Drupal (upload videos, conversion to flash video, storage on an offline [S3] filesystem). The one major hangup (it seems) is figuring out how to distribute ffmpeg in a way that's easily accessible to most people. I know little about creating distributable binaries (I freak out when I hear the word 'dependency'). If it's possible, that's awesome. If not, we could use Amazon's EC2 to create a disk image with ffmpeg compiled (and maybe Drupal installed and configured) which would basically allow anybody to create their own YouTube clone for $0.10/hr...

Anyhow, enough rambling... Hope some people are interested in this. :)

Cheers,
Matt

Comments

media_mover! thanks for the

sime's picture

media_mover! thanks for the link
I see you are a maintainer of it, so just a question of clarification. Is that being developed with fileapi.module compatibility in mind? or it's own drivers?

Not yet...

matt@antinomia's picture

I think pretty much everything is currently being developed independently of fileapi, because I don't think S3 is fully implemented in it yet (although, it looks like dopry has been working on it. Just a caveat, media_mover won't work as is, at the moment... Arthur put it together somewhat quickly based on code he had written previously, so there are still some bugs to fix.

I thing fileapi is the best direction for the S3 stuff to go. But as Douglas points out, adoption of this might be limited if we have to use the PEAR dependencies (and also require the php memory limit to be set higher than shared hosting will permit). I've read there is a way to do it using CURL, but I'm not savvy enough to figure this out...

--
Matt Koglin, Antinomia Solutions

Im super interested in

teamsand's picture

Im super interested in this!

It will allow for the small guys to not only provide services which have been beyond their means but maybe even compete with the larger fish.

There is a very infomative article here on how to use EC2 to encode media and S3 for storage http://developer.amazonwebservices.com/connect/entry.jspa?externalID=691&categoryID=102

Maybe useful for the media_mover.module . I would be very willing to help testing, supply a server or even donate to get something like this fuctional within drupal.

The one major hangup (it seems) is figuring out how to distribute ffmpeg in a way that's easily accessible to most people

EC2 runs off several images like fedora and ubunto that can easily run ffmpeg, several services are already using it for encoding.
The hard part is to wrap all this stuff up into a drupal module that is somewhat easy to configure, you have a lot stuff going on to organize in terms of storage buckets, cron jobs for the encoding, links between S3, EC2, and Queue services using the activation codes, thumbnail generation, etc. Maybe its easier than I think it is.

Also have you looked at this http://developer.amazonwebservices.com/connect/entry.jspa?externalID=128... Im new to this service and havent really looked into anything yet.

For the media_mover.module I believe it is possible to use EC2 as it does accept http posts and then use it to Put-->S3 you pay for the time but not the bandwidth, not viable for storage but very useful for media.

Also allowing http post directly to S3 is on thier proirity list but its seems to be still sitting there after 8+ months.

Sorry for all the edits Ive just come across this service and and digging into it.

Pear Dependencies

dopry's picture

You can eliminate the pear dependencies easily enough. There is an HMAC signing function in the s3 driver for fileapi. There is also one in the openID module for reference implementations.... I suggested it as an add in.. I may just go ahead and distill it into a crypto.module as I originally proposed it as a crypto.inc, but people didn't think it was necessary for core since nothing used it. ;).

You can also use drupal_http_request in place of the PEAR HTTP_Request. Most of the PEAR objects are memory hogs anyway. I have been having trouble with slow S3 interactions. I'm trying to work around that currently. I may fall back to CURL, but would rather not.

re: ffmpeg. Most people will just have to install it. It has to be compiled for the platform, but an as needed ffmpeg cluster on EC2 would be nice ;)... You can also look at transformer.module. I need to write some test cases and test the realtime file manipulation. Once that part is stable, I'm adding the priority queue support to queue long run processes that may need to be run from a seperate machine via a daemon.

I would appreciate help in any of these areas, my time is stretched thin. :)

.darrel.

S3 testing

mfredrickson's picture

This is somewhat off topic, but if anyone is writing tools to use S3 and you don't want to pay for the testing phase (or you want to run your own S3 compatible service), checkout ParkPlace by Ruby illuminatus _why.

Explaining Ruby, camping, or the rest of the app is beyond the scope of this comment, but I thought I'd point it out to y'all.

-M

S3 Module

mpare's picture

Matt, I am relatively new to drupal, php, and Amazon S3. I believe my drupal and php skills are maturing quickly and I would like to take on a module to contribute back into the community. I think S3 is picking up tremendous speed and I want to make it easier for others to take full advantage of the system. I would like to take on the responsibility of creating a Amazon S3 media module to use in drupal. I may not have an entire grasp on S3 but my purposed module would create a 1-to-1 or multi-to-1 relationship with drupal nodes and S3 objects/buckets. The module would allow you to easily track your buckets and objects and also allow an easy way to upload and manage and create your S3 objects/buckets. I have not worked out all the details but from what I understand this should be relatively easy and I personally would find this to be very valuable. Your ideas, corrections, and guidance would be greatly appreciated.

Peace,

Matthew Pare

Pare Technologies
info at paretech dot com

www.paretech.com

Peace,

-mpare

Pare Technologies
Drupal Consulting, Themeing, and Module Development
806.781.8324 | 806.733.3025
www.paretech.com

Figure Something Out? Document Your Success!

yes!

Veggieryan's picture

please please please
when we will see an s3 module for drupal?
i can hardly wait.
this will allow any drupal site to compete with any ol run of the mill web2.0 media storage thingy.

i personally want to build a site for musicians to collaborate on songs. that takes bandwidth.

pmail me as I can contribute testing, input and $$$
thanks,
ryan
www.thefractal.org

module progress?

benjaminlhaas's picture

Hello Matthew,

I'm curious, did you end up doing further work on an Amazon S3 media module? I'm looking into the idea of using Amazon S3 storage, and I'd like to see if there's any existing contributions from the Drupal community that I could leverage and/or extend.

All the best,
Ben

No, not really. :(

matt@antinomia's picture

No, not really. :(

--
Matt Koglin, Antinomia Solutions

revive the topic?

benjaminlhaas's picture

Well, I'm not sure if you want to revive this topic....but I did some groundwork on trying to develop a module purely for the purposes of off-loading Drupal image hosting onto Amazon S3. I ran into problems very quickly, though. It seems like Drupal doesn't have a clearly abstracted, separated file-system layer that exists independently of modules (N.B.: I'm probably exaggerating or just misinformed here).

For my purposes, I was developing on Drupal 4.7, and trying to integrate Amazon S3 services with the "Image" and "Node Images" modules. To some extent, I could use Drupal hooks (hook_nodeapi, hook_load, etc.) to store and retrieve the image data on S3 upon node creation and load (I'm glossing over details here - it would involve creating another Drupal database table for associating file ids with S3 file names). But when it's time for a module to display its images, each module calls file_create_url(). Note, it passes in a file path, not a node. The code around these calls needs to be modified independently in each module to be sure that the module looks for the file on S3 first.

I'm not sure how I would ideally want to solve the problem. Should there be an interface layer, between the modules and the file system functions? If so, this interface layer would take nodes as input, and talk to file.inc functions directly. Then this interface could be modified to use S3. But maybe that's too specific and/or unnecessary and/or unwieldy. Should the S3 module somehow be flexible enough to override each module's display function? I just don't know the answers to these questions yet.

Thanks,
Ben

worth reviving, IMHO

Dave Cohen's picture

You're right that Drupal is lacking a useful API for file handling abstraction. I created a module called Upload API, which I use to address this in my own work, and I still hope the community will embrace it. It's for Drupal 5.x, so may be of no use to you.

My contrib module does nothing with S3 at the moment, but it could certainly be extended to do so.

I also have a need to do

dreed47's picture

I also have a need to do image hosting via S3. I'd be interested in looking at your code Ben as it may help to jump-start my efforts. I eventually would like to implement something where users could use their own S3 accounts to store their own images.

I'm also interested in

grah's picture

I'm also interested in this.

Any progress by anyone on image storage to s3?

yes i created one for amazon..!!!

inders's picture

Hi,

I just was working on a S3 module for drupal and got it working.I used the s3 api code from google.But it was not a image node.It was a new content type for image media.Images are directly uploaded tos3 bucket which are configured and created by admin(This is a expendable section so that we can automatically check in code for size and choose next bucket for storage.As one is limited for 5GB).

You can check the site here;-
http://bluishtooth.com/bluweb
Testing on http://www.inderweb.com/
This also contain a file for making the thumbnails available on views and is very flexible for updates.

http://www.indiapoly.com/
Himachal India

S3 versus Youtube and Flickr

Amazon's picture

A quick short cut is to just host your video and photos on YouTube and Flickr. Then using the embeded filter module include them with TinyMCE media extension.

It doesn't replace generic file storage, but it solves the biggest use cases.

Kieran

To seek, to strive, to find, and not to yield

< a href="http://www.youtube.com/watch?v=COg-orloxlY">Support the Drupal installer, Install profiles, and module install forms
<a href="http://ia310107.us.archive.org/1/items/organicgroups_og2list/dru

subscribing

VenDG's picture

subscribing

Trying to figure this out as a drupal module is beyond me but since I am interested in one and hope someone does create it. I decided to do a google search on this and discovered

http://cesarodas.com/2007/09/php-amazon-s3-stream-wrapper.html

http://edoceo.com/creo/phps3tk/

http://www.anyexample.com/programming/php/uploading_files_to_amazon_s3_w...
http://www.anyexample.com/programming/php/downloading_files_from_amazon_...

http://notpopular.com/blogs/josh/2007/10/02/amazon-s3-php-awesome-image-...

There are recent effort out there of people attempting it with php, perhaps the code they release will help with module creation.

Server based solutions

konsumer's picture

I wonder if anyone has explored webserver rewrite-based solutions (most decent webservers have some sort of rewrite capabilities, and other features of drupal already require this - friendly-urls, for example)

I envision this:

  1. FIle is uploaded to app (for example an EC2 instance) by user
  2. An inotify based daemon is running, watching for changes. On change, the file is sent to S3 as public file (using S3 command-line tools, s3sync, PHP functions, whatever) and deleted on app server - or folder is an S3 mount, so inotify and delete is not needed.
  3. Rewrite is configured on app server to forward missing file requests in that dir to public S3 http location

Does this seem sensible?
What are the limitations people see with this approach?

I am currently leaning towards an S3 mount + rewrite based solution.

Hrmm. Close but not quite.

IncrediblyKenzi's picture

We thought about this as well.

We considered an inotify based daemon to handle a push to s3, but there are several problems with this approach:
- imagecache suffers, since it needs access to local files for processing. dopry suggested that php stream wrappers may be an approach (ref: fileapi), and it looks like some basic support for this approach is slated for D7, but it's complex and would be difficult to make a backport for.
- inotify / s3sync will only handles push, not pull, so files that are updated by one webhead won't get picked up by others. One way to handle this is to scan a s3sync for changes periodically and do a local update, but you're bound to get stagnant files.

We've also looked at ocfs2/drbd/clvm as a solution for clustered file systems, and then have one node do a periodic sync to s3 for backups, but we had some issues getting the drbd kernel module to compile against Amazon's 2.6.18 kernel (bad kernel src by default). I'll post more when we get past that headache and onto something interesting.

So there you have it.. This one's rough.. we've been looking at this for about a month and haven't found a clean solution.

You might have some success with the CDN module (http://drupal.org/projects/cdn). It has support for loading off remote CDNs, and an add-in for s3 wouldn't be that difficult. It does suffer some performance issues, however, so your mileage may vary.

As your alternate solution, s3 mount will be exceptionally slow, but it's worth a shot.

You might have some success

konsumer's picture

You might have some success with the CDN module (http://drupal.org/projects/cdn). It has support for loading off remote CDNs, and an add-in for s3 wouldn't be that difficult. It does suffer some performance issues, however, so your mileage may vary.

That link is dead.

I should also note before I say anything else, that I am using EC2, so transfers to and from S3 are free from any in my cluster. This whole line of thinking is crazy for anyone that has to pay for offloading (other then storage space)

As your alternate solution, s3 mount will be exceptionally slow, but it's worth a shot.

Slow is relative. It has less overhead then 2 disk writes (local, then remote) and a transfer (1 disk write and a transfer, all on the transparent userfs level.) Reads would definitely be slower, but less things would be reading the dir, other then checking to see if the file exists "locally", because it's all offloaded to S3 (and served from there, directly)

imagecache suffers, since it needs access to local files for processing.

Maybe a mix of solutions would work best. If imagecache is left out of the S3 mount it would be almost as fast. This way, you could mount S3 on files/images (or files/videos, or even files/s3/) and imagecache would make local copies (if they don't exist) Image cache files would be served locally, and other stuff would come from S3 directly. If the files were missing on a newly started instance, imagecache just creates it on first request. Sure, it would be a bit slower when imagecache reads the original (on S3, mounted locally) but it would only have to do this once.

Another problem I see, though, is it's reliance on S3. The recent outage makes me not want to design a system that relies on S3 for it's basic function, and rather use S3 as a way to get system instances scaling and backing up fast.

Maybe none of this makes sense. I guess I just want to discuss ideas I'm having about solving these issues with other people that are in the same boat.

oops. :)

IncrediblyKenzi's picture

Helps if I paste the right link:

http://drupal.org/project/cdn

It really depends on what your solution is going to look like and what you're trying to get out of using s3.

Most people use s3 with ec2 for:
- data persistence (e.g. backup).. If your instance goes down, you can restore files from s3.
- offloading static files (e.g. a ghetto CDN). This will allow apache to focus on delivery of generated content only. Serving from multiple buckets may be interesting as well, since it allows the browser to pipeline requests more effectively.

I took a first pass at the s3 saving with cdn module integration.. The notes and the (very rough) code for it are here.. The imagecache/private files issues were blockers for it, so we elected to move down a different route, but you're welcome to have a go and see if it makes sense for what you're doing:

http://www.workhabit.org/s3saver-notes-sacramento-drupal-users-group

Most people use s3 with ec2

konsumer's picture

Most people use s3 with ec2 for:
- data persistence (e.g. backup).. If your instance goes down, you can restore files from s3.

definitely. I am using scalr, so this part is pretty transparent, other then the files section. MySQL is backed up automatically using replication techniques and periodic S3 syncs, and I have base images that get loaded for file persistence from S3 transparently, on load. For my purposes, I just need a solid way to synchronize static, but changing file data, which should live in "files/images/" (for the most part)

offloading static files (e.g. a ghetto CDN). This will allow apache to focus on delivery of generated content only.

This is the part I'm most interested in. I'd like S3 to handle all the scaling for me. I'll let you know what my results of an apache/mount based solution are. I think I will just use a local imagecache for simplicity.

Thanks for your thoughts.

more thoughts

konsumer's picture

Ok, so I just came across this thread again, and realized that I hadn't posted any updates.

I used s3fs+mod_rewrite with great results. Much faster, and the overall load on my server was very small. As for speed, it was pretty fast both ways. S3 and EC2 instances were in the same zone, so it really was not an issue.

My basic methodology was this: reroute all access to "files" to S3. I mounted files as s3fs, and just wrote to it directly (well, over the s3fs mount), even in the case of imagecache.

i am very interested

drupalninja99's picture

do you have any links to how to accomplish this?

Follow me on twitter: @drupalninja

File API

Group organizers

Group notifications

This group offers an RSS feed. Or subscribe to these personalized, sitewide feeds: