My favorites | Sign in
Project Logo
                
Search
for
Updated Aug 17, 2009 by mi...@sorl.net
Labels: Phase-Design
StorageBackend  

S3 thoughts

The problem is, how is this supposed to work? I will outline a scenario:

  1. Client browser requests url
  2. url->view->template, when rendered the thumbnails are checked for timestamp.

So we need timestamp, we connect to s3 from our django server requiring timestamp for a file. This will result in the following communication:

Client browser -> Django server -> S3... ...->Client Browser. To me it does not look efficient.

What do you think?


Comment by kyle.fox, Dec 22, 2008

Yes, I'm not sure there's a good way of modifying the template tag to use S3. Perhaps someone more clever than I will be able to solve that problem.

Where sorl + S3 would be extremely useful to me is in the actual DjangoThumbnail? class itself, using the Python API. For example, it would be really cool if you could pass DjangoThumbnail? an optional storage object.

I think I'm using sorl a bit differently than most. Instead of relying on template tags I added properties to my Photo model, like so:

@property		
    def medium(self):
        return DjangoThumbnail(self.image.name, (1050,700))

With this method I have a known set of sizes, which I create at upload.

Sure, it wouldn't support the dynamic generation in templates feature -- but being able to have it generate a set of known sizes (like flickr) via the Python API and have them store on S3 would be incredibly useful.

Comment by antonio.mele, Dec 23, 2008

I created django-thumbs and it supports storage backends (tested with S3Storage? backend from django-storages) but it's much simpler than sorl-thumbnail.

Comment by elsdoerfer, Mar 10, 2009

I'm not sure what the current status here his (there's a SVN branch, but no work done yet apparently), but since I need this now (not so much for S3, but for #58) I had a look.

One problem seems to be that the current Django Storage implementation does not support anything like retrieving a timestamp. Since this is supposed to work generically with all storage engines, implementing it in a custom S3 backend is not enough, Django's FileSystemStorage? would need to support it as well, right? This looks like a roadblock to me right now.

Any work I'll be doing can be found here: http://github.com/miracle2k/sorl-thumbnail/tree/my-storage-refactor

Comment by Liubomir.Petrov, Mar 31, 2009

Even if there is a timestamp for files, this will need additional request to S3 for checks. Maybe the better way is to use caching of the available/unavailable (generated/not generated) thumbnails in DB (for a "persistent cache").

Anyone heading in that direction ?

Comment by skoczen, Jun 10, 2009

So I've got thumbnail and source saving to S3 working reasonably well. Code's here:

http://www.djangosnippets.org/snippets/1562/

General notes:

  • Set MEDIA_URL (or whatever you use for uploaded content to point to S3 (ie. MEDIA_URL = "http://s3.amazonaws.com/MyBucket/"))
  • Put django-storage in project_root/libraries, or change the paths to make you happy.
  • This uses the functionality of django-storage, but not as DEFAULT_FILE_STORAGE.

The functionality works like so:

Getting stuff to S3:

  • On file upload of a noted model, a copy of the uploaded file is saved to S3.
  • On any thumbnail generation, a copy is also saved to S3.

On a page load:

  1. We check to see if the thumbnail exists locally. If so, we assume it's been sent to S3 and move on.
  2. If it's missing, we check to see if S3 has a copy. If so, we download it and move on.
  3. If the thumb is missing, we check to see if the source image exists. If so, we make a new thumb (which uploads itself to S3)
  4. If the source is also missing, we see if it's on S3, and if so, get it, thumb it, and push the thumb back up.
  5. If all of that fails, somebody deleted the image, or things have gone fubar'd.

Advantages:

  • Thumbs are checked locally, so everything after the initial creation is very fast.
  • You can clear out local files to save disk space on the server (one assumes you needed S3 for a reason), and trust that only the thumbs should ever be downloaded.
  • If you want to be really clever, you can delete the original source files, and zero-byte the thumbs. This means very little space cost, and everything still works.
  • If you're not actually low on disk space, Sorl Thumbnail keeps working just like it did, except your content is served by S3.

Problems:

  • My python-fu is not as strong as those who wrote Sorl Thumbnail. I did tweak their code. Something may be wonky. YMMV.
  • The relative_source property is a hack, and if the first 7 characters of the filename are repeated somewhere, step 4 above will fail.
  • Upload is slow, and the first thumbnailing is slow, because we wait for the transfers to S3 to complete. This isn't django-storage, so things do genuinely take longer.

Comment by arockinit, Jul 28, 2009

Eek. Use django-storages for the backend rather than rolling your own. I've got django-imagekit working with django-storages and all I had to do was follow:

http://code.welldev.org/django-storages/wiki/S3Storage and then http://bitbucket.org/jdriscoll/django-imagekit/wiki/Home

With virtually no modification (except dropping cache_dir = 'photos' from the imagekit example). It looks like SORL may even work better than imagekit but there's no point in reinventing the s3 wheel when django-storages exists already.

Caching is where everybody should be focused IMHO.

Comment by skoczen, Sep 18, 2009

@ arockinit: Take a look at the code again. It quite explicitly uses django-storage for the back end - nothing custom has rolled. And, it provides caching by using the filesystem, instead of inventing a more complex DB layer.

I'm not saying it's perfect, but it's worked very well in practice, has been responsive, and didn't require reinventing anything. This broader approach (use django-storage to provide the S3 layer, use the filesystem (even if simply for zero-byte files) for caching) seems elegant to me, and might merit further consideration for the project.

Comment by Megaman821, Sep 23, 2009

Ignoring even the S3 issue, it would be nice for Sorl to use as much of the Django storage api as possible.

Then if someone wanted to use S3 they would only have to attack the timestamp issue.

Comment by kylemacfarlane, Oct 04, 2009

I want to give this a go today. Skoczen's basic logic is fine but requires custom save() methods on every model with files since it doesn't use DEFAULT_FILE_STORAGE.

Looking at the code, it's not a case of whether it'd be nice or not for sorl to use the Django storage API; it HAS to use it if we want stuff like django-storages to be usable. (This is 99% of the work right here.)

So I'm looking around for all the file system activity and it all seems to be in base.py, right? How the files are fed to PIL needs to be changed too.

Once I've got it all using the Django storage API I think for the timestamps I will then add an option to store them in the database. By default it will be off but slow as hell with things like S3 and CloudFiles?, but with it on it should be fine.

Comment by Megaman821, Oct 08, 2009

I find it easier to break this down into 2 steps. Making it work and then, Making it fast

As for making it work, I think it would be best to petition or fork django-storages and add a last_modified function to the storages. The S3 is absolutely trivial since the last-modified time is passed in the HTTP headers. So sorl would just have to check if there is a last_modified method and if not just assume the file is on a normal file system and figure out the modified time just as it does now.

As for making it fast, why not just use Django's cache decorator and vary the cache key based on the requested file path. If you apply the cache decorator to the functions last_modified and exists, sorl just run pretty fast. Then the user can control if the file meta-data is stored in a file, db, or memcached themselves in they set up caching. It is probably not a good idea to make that choice for them.

Comment by kylemacfarlane, Oct 10, 2009

1) I agree, but should it be getmtime() instead of last_modified() for consistency?

2) I don't think a decorator should be used. If someone uses a decorator and then also uses memcache for their cache and then reboots, every thumbnail would be rechecked. I think it might be better to just let the app accessing the data deal with the caching itself and then it could use a mixture of DB and cache (memcache) for speed and persistence.

Comment by Megaman821, Oct 10, 2009

1) The name isn't so important it should just be consistent and descriptive.

2) I am pretty sure if you do a little plumbing you can use multiple cache backends. Perhaps it is better for sorl to default to use a the file cache backend and a user could change it to db or memcached through a SORL_CACHE_BACKEND setting. I think as we add flexibility in one area, file backends, that we should take it away in another, caching.

Comment by Megaman821, Oct 18, 2009

I decided to kickoff things here: http://code.google.com/r/jasonchrista-backends/

The first stage is to get sorl-thumbnail working solely with the FileStorage?. Currently this is mostly working. There are still some places where the storage should be used and a lot of tests need updating.

The second stage is to get sorl-thumbnail working with alternate storages. The biggest issue I am currently having is what file PIL wants to read from and write to.

The third stage is to speed up alternative storages.


Sign in to add a comment
Hosted by Google Code