AmazonS3 Integration

I have been asked to integrate Amazon's S3 file store into Elgg. Its part of a larger project to build an AMI for Elgg 1.8. We're thinking that cache will remain in the standard file store but all file uploads, group icons and profile avatars have to move to S3. For the database we're going to migrate to Amazon's RDS. The hope is that with these changes we should be able to spin up additional elgg instances as demand increases. Any thoughts /  experiences are appreciated.

To start, I have created an AmazonS3 plugin to priovide basic S3 functionality. I have modified Elgg's file plugin to use S3. Did not modify or replace the Elgg FilePluginClass. Instead I chose to mirror the local file structure on S3 (all uploaded files and any thumbnails created when image files are uploaded). I then migrated url's to point to the S3 files rather than the local files. I use the same prefix and user file matrix to mirror the directory path for each user. Seems to work well.

Today's task is to migrate group's icons and user avatars to S3. Seems like a straight forward task. I notice Elgg uses icontime to cue the browser when to use cache copies of icons - I presume for performanced reasons. Icontime is still being updated, but since I am returning urls to S3 thumbnails icontime is not used (never calling thumbnail.php). Nor am I using download.php - instead using the amazonS3 url in the download button (title menu).

Two concerns. One, am I missing something? This seems to have gone too easily. And two, I'm wondering if I should have build an amazons3filestore class instead of manually replacing local filestore calls with amazonS3 calls. I chose not to do so because the abstracted ElggFileStore.php is a bit to fine grained to easliy slip in AmazonS3 (seek, read, open, etc). Doing so I think would require writing S3 files to a temp file anytime it is accessed - which seems unnecessary given how easily the first approach is...

Your thoughts and comments are apprciated. I'd rather find problems now in the prototype stage than have to change my approach later.

  • Thought I'd drop a note to update - though it appears no one is listening. Have ported user avatars, group icons, and the file mod all to use Amazon S3 for storage. At this point nothing is being written to the local file store except cache files.

    Thinking of tackling the switch to Amazon RDS tomorrow.

    Would love to get an outside view / opinion on this. Will this slow down elgg, speed it up or be a neutral mod?

  • don't let appearances fool you. i am watching this. i'd be curious re: an academic 'study' / review of what you are doing/going thru to get these aspects accomplished. wonder if you wanna write up some notes to guide others for doing same @ S3.

    as soon as elgg's data acceses go via any other route - elgg loses some workload - that by itself speeds up the elgg site. other factors may come into the perf picture -- eg. the perf/speeds of the 3rd party (data|db) wrappers.

    are you working with the elgg core code @mods or adding 'plugin functionalities' or patching via htaccess/apache arena/... ?

    all this amazon 'tech' is mostly new stuff for me - i know the underlyng technologies, just have not actually played much with the AmZ implementations and the API..

     

  • My only concern is that S3 URLs will of course not require any Elgg auth, so you may want to only sync ACCESS_PUBLIC files to S3, and make sure to remove them if they're deleted/change access level. In Elgg out of the box I think most profile pics are public, but personal files definitely may not be; dunno about private groups.

    If files remain local for stat calls (i.e. is_file doesn't trigger an S3 API call!), I think this shouldn't slow down much. Done any profiling?

  • @Dhrup - The goal of the project is to have an Elgg AMI (Amazon machine instance) that can be replicated as necessary for scalability. I am going to use Amazon's route53 for load balancing. For this to work, I need a common database which will be done using Amazon's RDS. Using the Elgg file store locally won't work so I've moved all file store to S3 with the exception of cache files. Those I'm thinking will be OK on the local file store for each instance.

    I have 1 new plugin - amazons3 that provides the basic library and overrides the user avatar upload and crop as well as deleting avatar images on user delete. In addition, it overrides the group edit so that group icons are stored on S3. It also overrides icon url handlers to return links to the S3 files. I decided to re-write the file plugin using the 1.8.5 version because the ElggFilePlugin class was hard wired to store locally. But basically it is a drop in replacement for the standard file plugin.

    @steve - The early scheme for permissions is to upload all icon files as public read. All other files are uploaded as private. This requires the download link in the title button to be an authenticated url. I am currently authenticating the url for 5 minutes. Otherwise, standard Elgg permissions apply. The only oddity is that after 5 minutes the url is no good. I'm thinking a javascript addition will prompt users to refresh the page after the url expires.

    So far it seems to work well - with nothing noticably different. Upgrading an existing site is beyond the scope of my project. Needs more testing of course. I know little of simple cache nor caching in general and would love to get feedback and thoughts on leaving it as local file store duplicated on each instance.

    I hope to get permission to make this work available to the Elgg community if there is interest.

  • @Jimmy:

    I think| believe| *know you're onto something (good) here..

    Sounds like a ('mySql federating') style for scaling the Elgg Data portions! An aspect that seems to have been 'talked about' (re: Amazon and a few other current technologies..) before.. but not quite 'talked of' @ details level ! Let's call it 'CDN-Elgg'?

    'Simple Cache' - per se might be better left alone for some while.. there are other, more matured (Apache|PHP) caching mechanisms abounding (eg. APC| at al) that could| might redundiate Elgg's own 'simple cache', maybe at some later stage(s).

    The '.. re-write file plugin using 1.8.5 version because ElggFilePlugin.. ' sounds like an alright preferred direction for your Design| Implementation.. probably the only way-to-go in this case. Could all it a 'fork' of sorts.. ;-oO

    Some thoughts -- (a) the Elgg Cache moving into Amazon might be an idea worth pursuing the impications and advantages thereof. (b) Looking into an Amazon S3| RDS powered  data storage for Elgg (including existing installs) could be worth the efforts.

    If you wish for some extra (academic &|or code review|devels assists - just gimme a holler! This has got certainly my interest now. My fave areas have been Data Strucs| Protocols| Language Strucs| Grammar Parsing..;oO This area you're venturing into - technology tangents and enhancements such as this is leading to - that will help propel Elgg into newer (stato)spheres.. eventually.

    I will look forward to and anticipate (w/ bated but expectant) patience for the code that you plan to publish here (or @GitHub).. perhaps for further collaborations| enhancements| enjoyment| implementations..


    Cheers;-oO 

     

  • Jimmy, we did Amazon integrations for Elgg before. I fear you may have a serious problem, or even a few.

    1. S3 was designed for backups. We tried using it for file storage, it was not reliable enought (both for upload and download, I don't remember which one was a killer). If I remember correctly, S3 don't guarantee your file was saved correctly, and checking this for every file upload is a problem. You should rather rely on another storage method (we set up separate instance for this). Maybe reliability of S3 did change, although I have strong suspictions it didn't (we tested it last time in January).

     

    2. just moving to Amazon is not enought to create highly scalable solution. In case you want to have website with moderate load, Amazon can increase number of users you can support. Supported requests per second are not high enought with pure Elgg to create really well scalable solution though, I think you can support approx. 8 req. per sec. on a small instance. This of course depends on number of plugins, configuration etc. You need good caching on framework's level though, APC and memcache is not enought.

     

    RDS is in fact a great solution for Elgg, I strongly advice you to use it. It allows you to forget about setting DB cluster. It works great. It's also super easy to scale up. In case you want a true scalability though, you have to do modifications in Elgg database as well. Eg. metadata queries, or pagination queries are a killer at some point.

     

    It's good idea to create one micro instance which is not used and then, when you create new instance, just copy it to new one.

     

    It's good to use NGINX instead of Apache.

     

    Memcache behaves wirdly when you use more than one instance, watch out for it :)

     

    We created a few high load solutions based on Elgg which were set up on Amazon. Creating a well scalable solution takes a lot of time though and requires rebuilding of Elgg database and view system, and implementing caching. We got results of 200 req. per second on a small Amazon instance, and plan to get a lot more (I suspect 500 is possible). It took however approx. 1200 hours of development to create a proper stable solution - and it wasn't easy to develop.

     

    We're just preparing materials about scalability of Elgg for Elgg London meetup (at 31'st May). In those materials, we will describe technology we used as well. In case you're interrested, I will share a link to this data with you.

  • @Dhrup I'd be greatful for any / all input and review. I will bring up the idea of publishing this at our next meeting. 

    @Mike - thanks for your input. I have not heard that S3 is unreliable. They do offer two levels of service - and their TOS leads me to believe that the lower priced tier may in fact be less reliable. I will bring this up at our next tech / biz meeting. Do you have additional information on this? If it is not appropriate and not 100% reliable then we need to find an alternative now... We're currently using apache though nginx has been discussed. I'll broach it with our server guy. 

    Here is a link to a diagram of our project: http://elgg-project.s3.amazonaws.com/project-diagram.pdf

    Thanks for the comments and input. Please - keep it coming.

  • Jimmy, tomorrow I will ask a developer working on S3 for a more detailed comment. I think we checked both options.

     

    Btw, you will need multi A-Z for RDS (twice as expensive). Otherwise, you will experience database going down approx. every 3 weeks. Handling this without multi A-Z is a lot of work (I think it can be done with a script, still you're risking you don't see notice of DB outage on time).

     

    I'm not sure you need secondary backup database. Remember you will probably need dev server as well though - it should also stand on Amazon.

  • @Mike - I just got off the phone with Amazon web services. They assure me that S3 is suitable for business-critical file store. They guarantee 99.9% availability and back that with credits if it ever falls short of that mark. However, I appreciate any additional information and experience data you have. After all - I was talking to an Amazon representative...

    I am using multi A-Z for my RDS. Even at twice the price it seems quite reasonable. I love the ability to take snapshots. And the ability to replicate the data store in separate regions reduces latency substantially.

    I look forward to hearing more about your S3 experience. File storage is our primary feature. It has to be extremely reliable and very available.

  • Jimmy, I talked with our developer. We had problems with S3 since our goal was to integrate it with fuse. You should probably be ok if you use it without fuse, you just have to check every time wether file was uploaded successfully.