Amazon S3, future “lingua franca” of object storage?


Block protocols allow, as their name implies, access to individual blocks of data with a granularity as low as 512 bytes. File-mode protocols access file-level data, so they usually lock the entire file during access, even on protocols such as SMB or NFS.

File and block mode protocols are well suited for certain classes of data. Block access is particularly well suited for applications such as databases, while NFS or CIFS work well with files organized in hierarchical structures.

But other access needs have appeared with the Cloud . One of the key protocols that has emerged with them is Amazon S3, the object protocol associated with the Amazon AWS S3 service. In fact, S3 is fast becoming the standard of object storage services . For example, a publisher like Scality chose to integrate it natively into its platform (the publisher even offers a free open-source version of its S3 server). The same goes for Cloudian .

Object storage

In Amazon S3, the object concept describes a data without a specific format. It can be an image, a log, a file or any type of unstructured data . These data are associated with a set of metadata – used to describe them. In the case of Amazon S3, each object can be associated with up to 2K of metadata.

As a storage service operated by Amazon AWS, Amazon S3 is arguably the largest public storage object, with Microsoft and Google using their own storage protocols. S3 debuted in 2006 and today stores tens of billions of objects, ranging in size from a few kilobytes to several terabytes.

These objects are organized by their owners in collections called “Buckets” (literally “buckets”), to which we can associate specific access rights. Excluding the concept of “Buckets”, the organization of data in S3 is “flat” and has absolutely nothing in common with the hierarchical concepts of structures such as a tree of files.

Simple commands

S3 is both the name of Amazon’s cloud storage service and the collection of protocols and APIs used to access the service. Amazon provides several access methods including a RESTful API and an HTTPS SOAP API (which is obsolete).

The S3 service can also be accessed via a number of SDKs including those for Java, .NET, PHP and Ruby.

In object storage like S3, objects are stored and retrieved via simple commands . PUT is used to store data, GET to find them, and DELETE to erase them. Updates are also done via PUT requests. But depending on the nature of the bucket, the PUT overwrites the pre-existing object or creates a new version (if versioning is enabled on the “bucket”). The S3 API also includes commands like COPY (for example to copy objects between “buckets”) or LIST (to list buckets, for example).

In S3, objects are referenced by a unique identifier chosen by the user. This can be the name of a file or a random series of characters. For example, “videos / 2014 / birthday / video1.wmv” or “my_photos_2015 / jan / Bogota.jpg” are acceptable identifiers, just like “AE28B3648527645C3B48F65A32A26”.

The ability to use its own identifiers “in clear” is a real differentiator against other object storage platforms that return an arbitrary alphanumeric identifier each time a new object is stored. It makes S3 more flexible and easier to use.

Storage classes in S3

There are three storage classes in Amazon S3, each with different access, availability, and cost characteristics. Each object stored in Amazon S3 can be individually associated with one of these three classes:

Standard: Amazon S3’s Generic Offer

Standard ( Infrequent Access ): a degraded version of the Standard offering for data that does not need optimal availability and access is infrequent.

Glacier: Glacier is designed for long-term data archiving and is associated with degraded criteria for access (it takes a minimum of 4 hours to find an object store in Glacier).

Each storage class is associated with a different price. For example, in Europe, the default price for the standard class is $ 0.03 / GB, while the Standard Infrequent Access class is charged at $ 0.0125 / GB. Storing data in Glacier is $ 0.007 / GB. In addition to these nominal costs, there are data transfer charges (especially for outgoing data) as well as costs depending on the number of read requests.

S3 remains a black box for its users

Amazon AWS does not provide technical details on how S3 is built. The supplier is limited to provide some indications that allow to know a little more about its operation.

Amazon Web Services operates from 12 geographic regions (the number is growing steadily). These regions are divided into availability zones that bring together one or more datacenters (currently 33). These Availability Zones (or Availability Zones) provide resilience and in the case of S3, the data is distributed between multiple areas with multiple copies in each area.

In terms of availability, Amazon guarantees a standard 99.99% availability of data (99.9% for Standard Infrequent Access) and a durability of 99.999999999% (indicative of the risk of data loss).

Use S3 in its applications

Amazon S3 essentially provides unlimited storage capacity without having to internally deploy the infrastructure needed to store the data without having to administer or support it. However, the service is not magic and there are essential points to know when you want to use it as a support for its applications:

Possibly consistent: The data consistency model of S3 is “possibly consistent” when updating data or deleting objects. By this, we mean that when an object is modified, there is a chance that a replay operation of this object will return its previous version. It takes a certain amount of time for replication of this object to take place between the availability zones of the same region. It is possible to protect against this risk, but it is necessary to code the adapted logic in its applications.

Compliance: In Amazon S3, data is stored in a limited number of countries, which can potentially cause regulatory problems. For example, there is no region “France” at Amazon AWS, but two regions in Europe (West and Central) whose data centers are respectively based in Ireland and Frankfurt, which can be a problem for some trades. This makes it interesting that some “on-premises” object storage systems adopt the S3 API because it opens the way for hybrid mode applications that use public cloud storage or private object storage in turn. All based on the same access protocol.

Security: S3 helps secure data access by leveraging AWS authentication key mechanisms. To access S3, it is possible to rely on the access keys of the account (even if it is not recommended). The preferred approach is to rely on the Amazon AWS Identification and Authentication Management (IAM) mechanisms, which allows you to create accounts for each employee who needs to access and assign data. the appropriate permissions (it is also possible to rely on identity federation mechanisms to generate temporary identifiers).

In S3 compatibility, managing these IAM mechanisms is a key point for vendors of third-party object storage platforms such as Scality or Cloudian.

To these access considerations, S3 adds security considerations. It is possible to encrypt the data with user-supplied encryption keys (which however requires the user to manage their keys appropriately).
Finally, S3 offers the ability to log all actions performed via the S3 API using another product from AWS, CloudTrail. This tracing capability is critical for businesses that need to audit their infrastructure for compliance or security purposes. It should be noted that CloudTrail is a paid service and that its data is stored in an S3 bucket, the cost of which is added to the set).

Lockout: S3 does not provide any function to serialize access to the data. It is up to the application using the service to ensure that multiple simultaneous PUT requests do not conflict. This requires developing the business logic that will control the read and write processes in environments where objects are frequently updated.

Cost: The cost of using Amazon S3 can grow significantly if data access is frequent. Each access is indeed billed, as are the associated data transfers (at least if the data leaves Amazon AWS services). And that’s without counting the investments to be made in the bandwidth to ensure the correct access performance to the service.

This is also what can make the implementation of an S3 compatible object storage internally interesting. The benefits of protocol and hybrid cloud capabilities are enhanced, while being able to leverage a powerful internal infrastructure for data accessed very frequently. Of course, you have to accept to manage storage infrastructure components internally.

Leave a Reply

Your email address will not be published. Required fields are marked *