Before I discuss file and object integration options, I will first explain why such integration is important. I see this integration as part of the evolving storage tiering model covered in this prior blog. For those who haven’t read that blog, I can summarize the subject as follows: storage tiers are different types of storage with different attributes and capabilities. For example, it is common to define mission-critical primary (i.e. application-facing) storage systems as “ “Tier 1” systems and backup/near on-line systems as “Tier 4”. In the legacy world 4 to 6 tiers were used, but the introduction of flash-based systems and object-based systems can simplify the tiering model and the number of tiers to 2, as follows:
The reasoning behind a 2-tier approach
The 2-tier approach is based on the assertion that flash based systems can provide optimal storage services for applications and infrastructure systems (e.g. cloud IaaS) and are very cost effective in terms of:
On the other hand, object-based secondary systems are highly scalable and provide cost efficiency in terms of:
In other words, flash-based systems are very efficient for IOPS-oriented profiles and object-based systems are very efficient for bandwidth-oriented profiles and cold storage.
In addition, since many object-based solutions use standardized interface variants for their access protocol, object integration helps customers avoid lock-in and ensure cost efficiency (due to the ability to choose from multiple, competing solutions).
In the legacy storage world, direct integration with object was rare simply because object-based systems are relatively new additions to the enterprise storage stack. Therefore, the legacy approach to object integration is mainly an indirect one as shown below. Applications are connected to primary storage solutions (file and/or block) and, typically, a backup server is used to move data from primary storage to a secondary storage backup/archival solution. The secondary storage may use proprietary and/or standard backup protocols (non-object related) such as NDMP for filers, NetBackup and similar solutions. Some backup/archival solutions have a connector to object storage solutions for cold storage or deep archival (see this list).
A typical, legacy object integration model
Modern integration options for primary and secondary storage
Before we dive into details, we need to differentiate between two major classes of integration options. In the first class, the secondary storage is used to extend the primary system. This means that standard I/O flows (typically handled by the primary system) can be fulfilled partially or completely by issuing I/Os to the secondary system. In a way, the secondary storage system is hidden behind the primary, such that the primary capacity is extended using the secondary capacity. Two typical mechanisms – caching and ILM are described later in this blog.
The second class of solutions is a family of solutions implementing copy services (backup, archiving, snapshot shipping, namespace sync, etc.) between the primary system and the secondary, object-based system. In this class of solutions, the integration is loosely coupled, such that each system is independent of the other. In other words, the primary and secondary solutions are not extending one another…instead, they interact to manage the copy services (i.e. to copy/sync the dataset between the different solutions and solution types). Two types of copy services integration – single file/object-based and dataset-based are described later in this blog.
By themselves, caching and dynamic tiering are not new concepts. However, the adaptation to use object stores on the backend is modern. When integrated in this way, the primary storage manages staging/destaging (or tiering) such that frequently accessed blocks/chunks/files are stored in the primary solution. Meanwhile, less frequently accessed (and/or modified) blocks are pushed to the secondary object store as specially formatted objects (see diagram below). As with any cache-based solution, if the cache algorithm is efficient (i.e. it achieves high “hit rates”) it can be very effective at maximizing performance and capacity. Of course, it also suffers from the classical disadvantages of cache-based solutions (see the comparison and discussion below). Note that, in most cases, a cache/tiering model is implemented in the lower storage layers and therefore can be implemented for all sorts of storage, including blocks and files.
A typical cache/tiering solution (simplified)
Another very effective integration model is based on a well-known concept known as ILM (also known as Hierarchical Storage Model (HSM)). As in the case of the caching mechanisms, the ILM concept is not new. We are discussing the modern solutions that have adapted the ILM model to the object store model.
In this model some (depicted below) a cooling/destaging mechanism is scanning the storage objects’ namespace and moving them to the colder tier (object store) according to some predefined policies.
In most cases, a stub is maintained in the primary storage to automate the retrieval of the objects from the secondary solution back and/or to provide a pass through mechanism (i.e. to forward the I/O without retrieval).
It may seem that ILM and caches are similar, but, in fact, they are very different: ILM is implemented at the file namespace-level while caches are typically implemented at the low-level block storage level. This means, for example, that ILM cannot be implemented for block systems (while object-level ILM exists and is implemented by all major cloud vendors). The main difference between ILM and caching is that ILM is that the cooling criteria of an ILM system is not based on real-time cache capacity, but rather on the predefined attributes of the files. In other words, a cache is designed to keep X amount of storage in the upper tier and dynamically stage and destage entities (blocks, chunks, etc.) from the backend tier. Instead, ILM is an auto-archiving system that pushes colder files to the secondary solution (and back) while the main dataset is hosted in the primary storage. In most cases, caches can be configured to use less primary storage space compared to ILM…but caches suffer from I/O service inconsistency if the active dataset changes (which is very common). Therefore caches are harder to tune and make it more difficult to ensure storage service quality.
A typical ILM (a.k.a. HSM) model (simplified)
The simplest method of file-to-object sync is based on copying files and namespaces to object stores and vice versa (as depicted below). Every major cloud provider provides such tools (e.g. s3cmd put/sync) and most commonly each file is encoded as a single object (excluding very large files that are split across several objects) with the full file path as the object key. Some storage/cloud vendors also provide automated mechanisms to schedule and perform such sync operations. This sync model is very common as it can be used for simple backup and/or as a simple dataset shipping mechanism, since many cloud services can consume “objectized” files directly from the object store.
While very simple and effective, the main disadvantage of this approach is that it only provides single object (or file) consistency. There is no dataset consistency and or dataset versioning (even though versioning for single objects exists). This means that, in terms of version control, this approach is similar to the old CVS single-file “check in”/”check out”semantics. This scheme does not provide an efficient namespace encoding as a directory listing involves relatively slow and inaccurate (i.e. eventually consistent) list search operations. For example, these lists can miss new files and/or return files that have already been deleted. Furthermore, some file attributes are not kept in the transitions (hard links, ACLs, etc.). The storage efficiently of this approach is also very limited as deduplication, differential sync, and small file compactization are very hard to achieve.
A file-to-object sync model
A much newer approach is the “check in”/”check out” of a complete dataset. In this model, the entire dataset snapshot is encoded (checked in) using some proprietary object format. It can also be atomically checked out. Just as the “single file to object sync” model can be compared to the CVS model, the file dataset “check in”/”check out” can be compared to the atomic dataset check in model used by all modern version control systems (distributed or not). The reason all modern systems use this model is simple…it ensures a higher level of object consistency required by many (if not most) applications. In other words, most applications are not able to cope with datasets that contain data from different points in time. With many files and directories encoded at once, it is also easier to provide capacity reduction techniques such as deduplication, compression, small file encoding, etc. In addition, a searchable index of the dataset can be produced (a “catalog”). Also, several versions of the dataset can be stored efficiently (i.e. using differential dataset updates). All of the above make this model a much more effective solution for backup/archival/DR and cross-site data sync, when compared to the single file sync method. Of course, there are also some downsides:
A dataset “check in”/”check out” model
As described above, even modern methods of integrating file and object storage have their challenges and tradeoffs. At Elastifile, we are developing solutions that combine that best aspects of these modern approaches to deliver uniquely seamless integration to our customers. Stay tuned for future blogs to learn more…