How does a PUT to a swift object server look like.

I have been trying lately to get a better understanding of the Swift code base, and I found the best way to know it was to read it from top to bottom and document it along the way. Here is some of my notes, hopefully more will come.

I am starting with an object PUT when the request is coming from the proxy server. The request in the log-file will look like this :

“PUT /sdb1/2/AUTH_dcbeb7f1271d4374b951954a4f1be15f/foo/file.txt” 201 – “-” “txdw08eca2842e344bb8e11b5869c81cb52″ “-” 0.0308

The WSGI controller send the request to the method swift.obj.server.ObjectController->PUT and start to do the following :

  • splits the request.path to :

device(sdb1), partition(2), account(AUTH_ACCOUNT_ID), container(foo), obj(file.txt)

  • Make sure that partition is mounted. (there is a mount_check option that can toggle this).
  • Ensure that there is a X-Timestamp header which should be set by the proxy server.
  • Start the check method check_object_creation which does the following :
  • Make sure the content_length is not greater than the MAX_FILE_SIZE.
  • Make sure there is a content_length header (except if the transfer has been chunked).
  • Make sure that there is no content_length (ie: zero byte body) when doing a X-Copy-From.
  • Make sure the object_name is not greater than MAX_OBJECT_NAME_LENGTH (1024 bytes by default).
  • Making sure we have a Content-Type in the headers passed (this could be set by the user or auto-guessed via mimetypes.guess_type on the proxy server).
  • When we have an header of x-object-manifest (for large files support) it makes sure the value is a container/object style and not contain chars like ? & / in the referenced objects names.
  • Checks metadata, make sure at first that the metadata name are not empty.
  • The metadata name length are not greater than MAX_META_NAME_LENGTH (default: 128).
  • The metadata value is not greater than MAX_META_VALUE_LENGTH (default: 256).
  • We don’t have a greater amount of metadatas than MAX_META_COUNT (default: 90).
  • The size of the headers combined (name+value) is not over MAX_META_OVERALL_SIZE (default: 4096).
  • If we have ‘X-Delete-At‘ (for the object expiration feature) we are making sure this is not happening in the past or we will exit with an HTTPBadRequest.
  • The class swift.obj.server.DiskFile will be the class that takes care to actually write the file locally. It gets instantiated and do the following in the constructor method:
  • It will hash the following  value (account, container, obj) which will become hashed for our example into :

46acec4563797178df9ec79b28146fe1

  • It will get the path where this is going to be store which going to be :

/srv/node/sdb1/objects/2/fe1/46acec4563797178df9ec79b28146fe1

  • /srv/node is the devices path which is the configuration directive [proxy]->devices (default to /srv/node).
  • sdb1 being the mounted device name.
  • add the datadir type, ”objects” for us.
  • and the partition power (2)
  • last three chars of the hashed name (fe1)
  • the hash itself 46acec4563797178df9ec79b28146fe1
  • It will get the temporary directory which become in our case to: /srv/node/sdb1/tmp it is basically the devices dir, the device and /tmp
  • If the directory didn’t exists before then it just return.
  • If the directory was existing (already uploaded) then it will parse all files in there and would looks if we have :
  • Files ending up with .ts  which will be the tombstone (a deleted file).  NB: Replication process will take care to os.unlink() the file properly later.
  • In case of a POST and if we have fast post setting enabled (see config object_post_as_copy in proxy_server) we will detect it and only do a copy of metadata.
  • It calculates the expiration time which is from now + the max_upload_time setting.
  • It start the etag hashing to gradually calculate the md5 of the object.
  •  Using the method mkstemp of DiskFile it will start to write to tmpdir, which does the creation of the file like that :
  • Make sure to create the tmpdir.
  • make a secure temporary file (using mkstemp(3)) and yield the file descriptor back to PUT.
  • If there is a content-length in the headers (assigned by the client) it will use the posix function fallocate(2) to pre-allocate that disk space to the file descriptor.
  • It will then iterate over chunk of data size defined by the configuration variable network_chunk_size (default: 64m) reading that chunk from the request wsgi.input :
  • It will update the upload_size value.
  • It will make sure we are not going over our upload expiration time (or get back HTTPRequestTimeout HTTP Error).
  • It will update the calculated md5 with that chunk.
  • It will write the chunk using python os.write
  • For large file sync which is over the configuration variable bytes_per_sync it will do a fdatasync(2) and drop the kernel buffer caches (so we are not filling up too much the kernel memory).
  • if we have a content-length in the client headers that doesn’t match the calculated upload_size we return a 499 Client Disconnected as it means we had a problem somewhere during the upload.
  • It will bail out if we have a etag in the client headers that doesn’t match the calculated etag.

And now we are starting defining our metadatas that we are going to store with the file  :

metadata = {
  ‘X-Timestamp’: timestamp generated from the proxy_server.
  ‘Content-Type‘: defined by the user or ‘guessed’ by the proxy server
  ‘ETag‘: calculated value from the request.
  ‘Content-Length‘: an fstat(2) on the file to get the proper value of what is stored on the disk.
}

  • It will add to the metadata every headers starting by ‘x-object-meta-’.
  • It will add to the metadata the allowed headers to be stored which is defined in the config variable allowed_headers (default: allowed_headers = Content-Disposition, Content-Encoding, X-Delete-At, X-Object-Manifest).
  • It will write the file using the put method of the DiskFile class, which finalise the writing on the file on disk and renames it from the temp file to the real location:
  • It will write the metadata using the xattr(1) feature which is stored directly with the file.
  • If there is a Content-Length with the metada it will drop the kernel cache of that metadata length.
  • It will invalidate the hashes of the datadir directory using the function swift.obj.replicator.invalidate_hashes
  • It will set the hash of the dir as None, which would hint the replication process to have something to do with that dir (and that hash will be generated).
  • This file is stored by partition as python pickle which is in our case: /srv/node/sdb1/objects/2/hashes.pkl
  • Move the file from the tmp dir to go to the datadir.
  • It will use the method unlinkold from DiskFile to remove any older versions of the object file which is any files that has older timestamp.
  • It will start construct the request to make to a containers by going passing the following:
  • account, container, obj as request path.
  • the original headers.
  • the headers Content-Length, Content-Type, X-Timestamp, Etag, X-trans-ID.
  • It will get the headers X-Container-{Host,Partition,Device} from the original headers which is defined by the proxy to know on which container server it going to update. Every different PUT will have assigned a different container to each their own.
  • It will use the async_update method (by self since it’s part of the same class) to make an asynchronous request:
  • Passing the aforementioned build headers and req.path.
  •  If the request success (between 200 to 300) it will return to the main (PUT) method.
  •  the request didn’t succeed it will create a async_pending file locally in the tmp dir which is going to be picked-up by the replication process to update the container listing when the container is not too busy.
  • When finish it will respond by a HTTPCreated
Share on FacebookTweet about this on TwitterShare on Google+Print this pageEmail this to someone
  • Pingback: SquareCows.com » Community Weekly Review (Feb 3-10)

  • Pingback: Community Weekly Review (Feb 3-10) - openstackAPI | openstackAPI

  • Anonymous

    Thank you for posting this. It’s very helpful in understanding the object server at a lower level. Is the default network_chunk_size 64m or 64k ?

  • Anonymous

    the value need to specified in bytes and it is 65536 by default so 64 converted to megabyte.

  • gmm

    Hi,
    Thanks for the great article.
    Can you guide me about how GET request will flow?
    I did not see any os.read function get called in source.
    Please help !!!

  • Manish K Singh

    Hi, Thanks for posting this blog. This very helpful.

    Why do we need to keep the copies of ring (account.ring.gz container.ring.gz object.ring.gz) on storage node? Just for backup purpose or some other things/operations as well?