How does a PUT to a swift object server look like.

I have been trying lately to get a better understanding of the Swift code base, and I found the best way to know it was to read it from top to bottom and document it along the way. Here is some of my notes, hopefully more will come.

I am starting with an object PUT when the request is coming from the proxy server. The request in the log-file will look like this :

“PUT /sdb1/2/AUTH_dcbeb7f1271d4374b951954a4f1be15f/foo/file.txt” 201 – “-” “txdw08eca2842e344bb8e11b5869c81cb52” “-” 0.0308

The WSGI controller send the request to the method swift.obj.server.ObjectController->PUT and start to do the following :

  • splits the request.path to :

device(sdb1), partition(2), account(AUTH_ACCOUNT_ID), container(foo), obj(file.txt)

  • Make sure that partition is mounted. (there is a mount_check option that can toggle this).
  • Ensure that there is a X-Timestamp header which should be set by the proxy server.
  • Start the check method check_object_creation which does the following :
  • Make sure the content_length is not greater than the MAX_FILE_SIZE.
  • Make sure there is a content_length header (except if the transfer has been chunked).
  • Make sure that there is no content_length (ie: zero byte body) when doing a X-Copy-From.
  • Make sure the object_name is not greater than MAX_OBJECT_NAME_LENGTH (1024 bytes by default).
  • Making sure we have a Content-Type in the headers passed (this could be set by the user or auto-guessed via mimetypes.guess_type on the proxy server).
  • When we have an header of x-object-manifest (for large files support) it makes sure the value is a container/object style and not contain chars like ? & / in the referenced objects names.
  • Checks metadata, make sure at first that the metadata name are not empty.
  • The metadata name length are not greater than MAX_META_NAME_LENGTH (default: 128).
  • The metadata value is not greater than MAX_META_VALUE_LENGTH (default: 256).
  • We don’t have a greater amount of metadatas than MAX_META_COUNT (default: 90).
  • The size of the headers combined (name+value) is not over MAX_META_OVERALL_SIZE (default: 4096).
  • If we have ‘X-Delete-At‘ (for the object expiration feature) we are making sure this is not happening in the past or we will exit with an HTTPBadRequest.
  • The class swift.obj.server.DiskFile will be the class that takes care to actually write the file locally. It gets instantiated and do the following in the constructor method:
  • It will hash the following  value (account, container, obj) which will become hashed for our example into :


  • It will get the path where this is going to be store which going to be :


  • /srv/node is the devices path which is the configuration directive [proxy]->devices (default to /srv/node).
  • sdb1 being the mounted device name.
  • add the datadir type, ”objects” for us.
  • and the partition power (2)
  • last three chars of the hashed name (fe1)
  • the hash itself 46acec4563797178df9ec79b28146fe1
  • It will get the temporary directory which become in our case to: /srv/node/sdb1/tmp it is basically the devices dir, the device and /tmp
  • If the directory didn’t exists before then it just return.
  • If the directory was existing (already uploaded) then it will parse all files in there and would looks if we have :
  • Files ending up with .ts  which will be the tombstone (a deleted file).  NB: Replication process will take care to os.unlink() the file properly later.
  • In case of a POST and if we have fast post setting enabled (see config object_post_as_copy in proxy_server) we will detect it and only do a copy of metadata.
  • It calculates the expiration time which is from now + the max_upload_time setting.
  • It start the etag hashing to gradually calculate the md5 of the object.
  •  Using the method mkstemp of DiskFile it will start to write to tmpdir, which does the creation of the file like that :
  • Make sure to create the tmpdir.
  • make a secure temporary file (using mkstemp(3)) and yield the file descriptor back to PUT.
  • If there is a content-length in the headers (assigned by the client) it will use the posix function fallocate(2) to pre-allocate that disk space to the file descriptor.
  • It will then iterate over chunk of data size defined by the configuration variable network_chunk_size (default: 64m) reading that chunk from the request wsgi.input :
  • It will update the upload_size value.
  • It will make sure we are not going over our upload expiration time (or get back HTTPRequestTimeout HTTP Error).
  • It will update the calculated md5 with that chunk.
  • It will write the chunk using python os.write
  • For large file sync which is over the configuration variable bytes_per_sync it will do a fdatasync(2) and drop the kernel buffer caches (so we are not filling up too much the kernel memory).
  • if we have a content-length in the client headers that doesn’t match the calculated upload_size we return a 499 Client Disconnected as it means we had a problem somewhere during the upload.
  • It will bail out if we have a etag in the client headers that doesn’t match the calculated etag.

And now we are starting defining our metadatas that we are going to store with the file  :

metadata = {
  ‘X-Timestamp’: timestamp generated from the proxy_server.
  ‘Content-Type‘: defined by the user or ‘guessed’ by the proxy server
  ‘ETag‘: calculated value from the request.
  ‘ContentLength‘: an fstat(2) on the file to get the proper value of what is stored on the disk.

  • It will add to the metadata every headers starting by ‘x-object-meta-‘.
  • It will add to the metadata the allowed headers to be stored which is defined in the config variable allowed_headers (default: allowed_headers = Content-Disposition, Content-Encoding, X-Delete-At, X-Object-Manifest).
  • It will write the file using the put method of the DiskFile class, which finalise the writing on the file on disk and renames it from the temp file to the real location:
  • It will write the metadata using the xattr(1) feature which is stored directly with the file.
  • If there is a Content-Length with the metada it will drop the kernel cache of that metadata length.
  • It will invalidate the hashes of the datadir directory using the function swift.obj.replicator.invalidate_hashes
  • It will set the hash of the dir as None, which would hint the replication process to have something to do with that dir (and that hash will be generated).
  • This file is stored by partition as python pickle which is in our case: /srv/node/sdb1/objects/2/hashes.pkl
  • Move the file from the tmp dir to go to the datadir.
  • It will use the method unlinkold from DiskFile to remove any older versions of the object file which is any files that has older timestamp.
  • It will start construct the request to make to a containers by going passing the following:
  • account, container, obj as request path.
  • the original headers.
  • the headers Content-Length, Content-Type, X-Timestamp, Etag, X-trans-ID.
  • It will get the headers X-Container-{Host,Partition,Device} from the original headers which is defined by the proxy to know on which container server it going to update. Every different PUT will have assigned a different container to each their own.
  • It will use the async_update method (by self since it’s part of the same class) to make an asynchronous request:
  • Passing the aforementioned build headers and req.path.
  •  If the request success (between 200 to 300) it will return to the main (PUT) method.
  •  the request didn’t succeed it will create a async_pending file locally in the tmp dir which is going to be picked-up by the replication process to update the container listing when the container is not too busy.
  • When finish it will respond by a HTTPCreated

Audit a swift cluster

Swift integrity tools.

There is quite a bit of tools shipped with Swift to ensure you have the right object on your cluster.

At first there is the basic :


It will take a swift object stored on the filesystem and print some infos about it, like this :

swift@storage01:0/016/0b221bab535ac1b8f0d91e394f225016$ swift-object-info
Path: /AUTH_root/foobar/file.txt
Account: AUTH_root
Container: foobar
Object: file.txt
Object hash: 0b221bab535ac1b8f0d91e394f225016
Ring locations: – /srv/node/sdb1/objects/0/016/0b221bab535ac1b8f0d91e394f225016/
Content-Type: text/plain
Timestamp: 2012-01-31 06:30:17.014110 (1327991417.01411)
ETag: 053a0f8516a5023b9af76c49ca917d3e (valid)
Content-Length: 24 (valid)
User Metadata: {‘X-Object-Meta-Mtime’: ‘1327968327.21’}

PS: If you don’t know where is your object on which node, you can you use swift-get-nodes

For auditing, the Etag value is important because swift-object-info will compare the object recorded etag in the metadata with what we have on the disks. Let’s try to see if that works :

swift@storage01:0/016/0b221bab535ac1b8f0d91e394f225016$ cp /tmp
swift@storage01:0/016/0b221bab535ac1b8f0d91e394f225016$ echo “foo” >>
swift@storage01:0/016/0b221bab535ac1b8f0d91e394f225016$ swift-object-info|grep ‘^Etag’
Etag: 053a0f8516a5023b9af76c49ca917d3e doesn’t match file hash of 9ff871e5ce5dcb5d3f2680a80a88ff38!

swift-object-info has detected that this file is not the one we have uploaded.

There is an other tool called swift-drive-audit which as explained in the admin guide will parse the /var/log/kern.log and have predefined regexp  to detect disk failure notified by the kernel. It is usually run periodically by cron and there is a config file for it called /etc/swift/drive-audit.conf. If the script find any errors for a certain drive it will unmount it and comment it in /etc/fstab(5). Afterwards  the replication process will pick it up from other replicas and put the object on that drive in handover.

Swift provide as well different type of auditor daemons for account/container/object :

  •  swift-account-auditor
  •  swift-container-auditor
  •  swift-object-auditor

swift-account-auditor will open all sqlite db of an account server and launch a SQL query to make sure all the dbs are valid.
swift-container-auditor will do the same but for containers.
swift-object-auditor will open all object of an object server and make sure of :

  • Metadata are correct.
  • We have the proper size.
  • We have the proper MD5.

Those auditors needs to be set in each type-server.conf, for example for account server you will add something like this to /etc/swift/account-server.conf :

# You can override the default log routing for this app here (don’t use set!):
# log_name = account-auditor
# log_facility = LOG_LOCAL0
# log_level = INFO
# Will audit, at most, 1 account per device per interval
interval = 1800
# log_facility = LOG_LOCAL0
# log_level = INFO

For container this is about the same options but for object-server does are the options :

# You can override the default log routing for this app here (don’t use set!):
# log_name = object-auditor
# log_facility = LOG_LOCAL0
# log_level = INFO
# files_per_second = 20
# bytes_per_second = 10000000
# log_time = 3600
# zero_byte_files_per_second = 50

Another tool shipped with swift is swift-account-audit which will audit a full account and report if there is missing replicas or incorrect object in that account.