Scenario
Users can post an item and include up to 5 images with the post, each image that's uploaded needs to be resampled and resized - a total of 4 extra images are created. Meaning if the user uploads 5 images end up with 25 images total to store.
Assumptions
- The images have been properly checked and they're valid image files
- The system has to scale (let's assume 1000 posts in the first instance, so maximum 5000 images)
- Each image is renamed in relation to the auto_incremenet id of the db post entry and includes relevant suffix i.e. 12345_1_1.jpg 12345_2_1.jpg - so there's no issues with duplicates
- The images aren't of a sensitive nature, so there's no issues with having them directly accessible (although directory listing would be disabled)
Possible approaches
- Given the ids are unique we could just drop them into one folder (ineffecient after a certain point).
- Could create a folder for each post and place all the images into that, so ROOT/images/12345 (again, would end up with a multitude of folders)
- Could do an image store based on date, i.e. each day a new folder is created and the days images are stored in there.
- Could store the images based on the resized type, i.e. all the original files could be stored in one folder images/orig all the thumbnails in images/thumb (i think Gumtree uses an approach like this).
- Could allow X amount of files to be stored in one folder before creating another one.
Anyone have experience on the best practices / approaches when it comes to storing images scalably?
Note: I prememt someone will mention S3 - let's assume we want to keep the images locally for the time being.
Thanks for looking
Source: Tips4all
We have such a system in heavy production with 30,000+ files and 20+ GB to date...
ReplyDeleteColumn | Type | Modifiers
-------------+-----------------------------+----------------------------------------------------------
File_ID | integer | not null default nextval('"ACRM"."File_pseq"'::regclass)
CreateDate | timestamp(6) with time zone | not null default now()
FileName | character varying(255) | not null default NULL::character varying
ContentType | character varying(128) | not null default NULL::character varying
Size | integer | not null
Hash | character varying(40) | not null
Indexes:
"File_pkey" PRIMARY KEY, btree ("File_ID")
The files are just stored in a single directory with the integer File_ID as the name of the file. We're over 30,000 with no problems. I've tested higher with no problems.
This is using RHEL 5 x86_64 with ext3 as the file system.
Would I do it this way again? No. Let me share a couple thoughts on a redesign.
The database is still the "master source" of information on the files.
Each file is sha1() hashed and stored in a filesystem hierarchy based on that hash:
/FileData/ab/cd/abcd4548293827394723984723432987.jpg
the database is a bit smarter about storing meta-information on each file. It would be a three table system:
File : stores info such as name, date, ip, owner, and a pointer to a Blob (sha1)
File_Meta : stores key/value pairs on the file, depending on the type of file. This may include information such as Image_Width, etc...
Blob : stores a reference to the sha1 along with it's size.
This system would de-duplicate the file content by storing the data referenced by a hash (multiple files could reference the same file data). It would be very easy to backup sync the file database using rsync.
Also, the limitations of a given directory containing a lot of files would be eliminated.
The file extension would be stored as part of the unique file hash. For example, if the hash for an empty file were abcd8765... An empty .txt file and empty .php file would refer to the same hash. Rather, they should refer to abcd8765.php and abcd8765.txt. Why?
Apache, etc.. can be configured to automatically choose the content type and caching rules based on the file extension. It is important to store the files with a valid name and the extension which reflects the content of the file.
You see, this system could really boost performance by delegating the file delivery through nginx. See http://wiki.nginx.org/XSendfile.
I hope this helps in some way. Take care.
I would store all the images in a single folder - the database then keeps a track of files names - keep it simple
ReplyDeleteFirst of, I would recommend creating a table for the images. This is a one row / image file table:
ReplyDelete| id | filename | type | storage |
---------------------------------------
| 123 | 123.png | original | store1 |
id auto incremental int or something equally unique.
filename the file base name. This allows you to move the file and just update the code. The filename could be {file_id}.{extension}.
type is the type of image: original, thumbnail, resized, whatever. Could also be the dimensions: 100x100, 500x, x500 (where 500x would be unlimited height and x500 would be unlimited width). These are just some examples.
storage would an identifier for where the file is, this could be a directory. Say that you store your images in post_images, filename is 123.png and storage is store1, then the path would be post_images/store1/123.png.
I have yet to try this myself, but I have problems with web apps storing 10k+ files in the same directory.