RESOURCE SHARING ON INTERNET CERTIFICATE CONTENTS : CHAPTER 1 ______________________________________________________
1.1 IntroductionIt is commonly observed that the continued exponential growth in the capacity of fundamental computing resources — processing power, communication bandwidth, and storage is working a revolution in the capabilities and practices of the research community. It has become increasingly evident that the most revolutionary applications of this superabundance use resource sharing to enable new possibilities for collaboration and mutual benefit. Over the past 30 years, two basic models of resource sharing with different design goals have emerged. The differences between these two approaches, which are distinguished as the Computer Center and the Internet models, tend to generate divergent opportunity spaces, and it therefore becomes important to explore the alternative choices they present as we plan for and develop an information infrastructure for the scientific community in the next decade.
Interoperability and scalability are necessary design goals for distributed systems based on resource sharing, but the two models we consider differ in the positions they take along the continuum between total control and complete openness. The difference affects the tradeoffs they tend to make in fulfilling their other design goals. The Computer Center model, which came to maturity with the NSF Supercomputer Centers of the 80s and 90s, was developed in order to allow scarce and extremely valuable resources to be shared by a select community in an environment where security and accountability are major concerns. The form of sharing it implements is necessarily highly controlled – authentication and access control are its characteristic design issues. In the last few years this approach has given rise to a resource sharing paradigm known as information technology “Grids.” Grids are designed to flexibly aggregate various types of highly distributed resources into unified platforms on which a wide range of “virtual organizations” can build. By contrast, the Internet paradigm, which was developed over the same 30 year period, seeks to share network bandwidth for the purpose of universal communication among an international community of indefinite size. It uses lightweight allocation of network links via packet routing in a public infrastructure to create a system that is designed to be open and easy to use, both in the sense of giving easy access to a basic set of network services and of allowing easy addition of privately provisioned resources to the public infrastructure. While admission and accounting policies are difficult to implement in this model, the power of the universality and generality of the resource sharing it implements is undeniable.
Though experience with the Internet suggests that the transformative power of information technology is at its highest when the ease and openness of resource sharing is at its greatest, the Computer Center model experiencing a rebirth in the Grid while the Internet paradigm has yet to be applied to any resource other than communication bandwidth. But the Internet model can be applied to other kinds of resources, and that, with the current Internet and the Web as a foundation, such an application can lead to similarly powerful results. The storage technology developed is called the Internet Backplane Protocol (IBP), designed to test this hypothesis and explore its implications. IBP is our primary tool in the study of logistical networking, a field motivated by viewing data transmission and storage within unified framework.
1.2 What Is IBP?The proliferation of applications that are performance limited by network speeds leads us to explore new ways to exploit data locality in distributed settings. Currently, standard networking protocols (such as TCP/IP) support end-to-end resource control. An application can specify the endpoints associated with a communication stream, and possibly the buffering strategy to use at each endpoint, but have little control over how the communicated data is managed while it is traversing the network. In particular, it is not possible for the application to influence where and when data may be stored (other than at the endpoints) so that it can be accessed efficiently.
To support domain and application-specific optimization of data movement, the Internet Backplane Protocol (IBP) has been developed for managing storage within the network fabric itself. IBP allows an application to implement inter-process communication in terms of intermediate data staging operations so that locality can be exploited and scarce buffer resources managed more effectively.
In the case of the Internet, a large distributed system providing a communication service to systems at its periphery, such considerations of robustness and scalability led its designers to choose a stateless model, which is the networking equivalent of the functional model of computation. Routers perform stateless data transfer operations and control is calculated by routing algorithms which require only that the routers behave correctly, not that they maintain shared state. Indeed, the end-to-end model of IP networking has served the Internet well, allowing it to scale far beyond its original design while maintaining stable levels of performance and reliability.
It is important to note that the designers of large-scale information systems often follow a shared memory model because the functional model puts undesirable limitations on performance and control. It is difficult to express the management of stored data in functional terms, and resource management is the key to high performance. Moreover, people think naturally in terms of processes, not just computations, and persistence is the key to defining a process. Increasingly the design of Internet-based information systems is moving toward shared-memory approaches that support the management of distributed data, rather than just access. A rapidly growing global industry is being built up around the caching and replication of content on the World Wide Web and massive scientific datasets are being distributed via networked systems of large storage resources.
But the management of shared state in such systems is in no way a part of the underlying network model. All such management is consequently pushed up to the application level and must be implemented using application-specific mechanisms that are rarely interoperable. The result has been a balkanization of state management capabilities that prevents applications that need to manage distributed state from benefiting from the kind of standardization, interoperability, and scalability that have made the Internet into such a powerful communication tool. The Internet Backplane Protocol (IBP) has been defined in order to provide a uniform interface to state management that is better integrated with the Internet.
Because IBP is a compromise between two conflicting design models, it does not fit comfortably into either of the usual categories of mechanisms for state management: IBP can be viewed either as a mechanism to manage either communication buffers or remote files and both are equally valid characterizations and useful in different situations. If, in order to allow our terminology to be as neutral as possible, we simply refer to the units of data that IBP manages as byte arrays, then these different views of IBP can be presented as follows:
IBP as buffer management: In communication between nodes on the Internet, which is built upon the basic operation of delivering packets from sender to receiver, each packet is buffered at each intermediate node. In other words, the communication makes use of storage in the network. Because the capacity of even large storage systems is tiny compared with the amount of data that flows through the Internet, allocation of communication buffers must be time limited. In current routers and switches, time-limited allocation is implemented by use of FIFO buffers, serviced under the constraints of fair queuing.
Against this background, IBP byte arrays can be viewed as application-managed communication buffers in the network. IBP allows the use of time-limited allocation and FIFO disciplines to constrain the use of storage. These buffers can improve communication by way of application-driven staging of data, or they may support better network utilization by allowing applications to perform their own coarse-grained routing of data.
IBP as file management:
Since high-end Internet applications often transfer gigabytes of data, the systems to manage storage resources for such applications are often on the scale of gigabytes to terabytes in size. Storage on this scale is usually managed using highly structured file systems or databases with complex naming, protection and robustness semantics. Normally such storage resources are treated as part of a host system and therefore as more or less private.
From this point of view IBP byte arrays can be viewed as files that live in the network. IBP allows an application to read and write data stored at remote sites, and direct the movement of data among storage sites and to receivers. In this way IBP creates a shareable network resource for storage in the same way that standard networks provide shareable bandwidth for file transfer.
This characterization of IBP as a mechanism for the management of state in the network supplies an operational understanding of the proposed approach, but does not provide a unified view that synthesizes storage and networking together. In order to arrive at this more general view, we say that routing of packets through a network is a series of spatial choices that allows for control of only one aspect of data movement. An incoming packet sent out on one of several alternative links, but any particular packet is held in communication buffers for as short time as possible.
IBP, on the other hand, allows for data to be stored at one location while en route from sender to receiver, adding the ability to control data movement temporally as well as spatially. This generalized notion of data movement is termed logistical networking, drawing an analogy with systems of warehouses and distribution channels commonly used in the logistics of transporting military and industrial material. Logistical networking can improve an application’s performance by allowing files to be staged near where they will be used, data to be collected near its source, or content to be replicated close to its users.
1.3 BackgroundThe Internet Backplane Protocol is a mechanism developed for the purpose of sharing storage resources across networks ranging from rack mounted clusters in a single machine room to global networks. To approximate the openness of the Internet paradigm for the case of storage, the design of IBP parallels key aspects of the design of IP, in particular IP datagram delivery. This service is based on packet delivery at the link level, but with more powerful and abstract features that allow it to scale globally. Its leading feature is the independence of IP datagrams from the attributes of the particular link layer, which is established as follows:
· Aggregation of link layer packets masks its limits on packet size;
· Fault detection with a single, simple failure model (faulty datagrams are dropped) masks the variety of different failure modes;
· Global addressing masks the difference between local area network addressing schemes and masks the local network’s reconfiguration.
This higher level of abstraction allows a uniform IP model to be applied to network resources globally, and it is crucial to creating the most important difference between link layer packet delivery and IP datagram service. Namely,
Any participant in a routed IP network can make use of any link layer connection in the network regardless of who owns it. Routers aggregate individual link layer connections to create a global communication service.
This IP-based aggregation of locally provisioned, link layer resources for the common purpose of universal connectivity constitutes the form of sharing that has made the Internet the foundation for a global information infrastructure. IBP is designed to enable the sharing of storage resources within a community in much the same manner. Just as IP is a more abstract service based on link-layer datagram delivery, IBP is a more abstract service based on blocks of data (on disk, tape, or other media) that are managed as “byte arrays.” The independence of IBP byte arrays from the attributes of the particular access layer (which is our term for storage service at the local level) is established as follows:
· Aggregation of access layer blocks masks the fixed block size;
· Fault detection with a very simple failure model (faulty byte arrays are discarded) masks the variety of different failure modes;
· Global addressing based on global IP addresses masks the difference between access layer addressing schemes.
This higher level of abstraction allows a uniform IBP model to be applied to storage resources globally, and this is essential to creating the most important difference between access layer block storage and IBP byte array service:
Any participant in an IBP network can make use of any access layer storage resource in the network regardless of who owns it. The use of IP networking to access IBP storage resources creates a global storage service.
Regardless of the strengths of the IP paradigm, however, its application here leads directly to two problems. First, in the case of storage, the chronic vulnerability of IP networks to Denial of Use (DoU) attacks is greatly amplified. The free sharing of communication within a routed IP network leaves every local network open to being overwhelmed by traffic from the wide area network, and consequently open to the unfortunate possibility of DoU from the network. While DoU attacks in the Internet can be detected and corrected, they cannot be effectively avoided. Yet this problem is not debilitating for two reasons: on the one hand, each datagram sent over a link uses only a tiny portion of the capacity of that link, so that DoU attacks require constant sending from multiple sources; on the other hand, monopolizing remote communication resources cannot profit the attacker in any way - except, of course, economic side-effects of attacking a competitor’s resource. Unfortunately neither of these factors hold true for access layer storage resources. Once a data block is written to a storage medium, it occupies that portion of the medium until it is de-allocated, so no constant sending is required. Moreover it is clear that monopolizing remote storage resources can be very profitable for an attacker.
The second problem with sharing storage network style is that the usual definition of a storage service is based on processor-attached storage, and so it includes strong semantics (near-perfect reliability and availability) that are difficult to implement in the wide area network. Even in “storage area” or local area networks, these strong semantics can be difficult to implement and are a common cause of error conditions. When extended to the wide area, it becomes impossible to support such strong guarantees for storage access.
Both issues are addressed through special characteristics of the way IBP allocates storage:
Allocations of storage in IBP can be time limited. When the lease on an allocation expires, the storage resource can be reused and all data structures associated with it can be deleted. An IBP allocation can be refused by a storage resource in response to over allocation, much as routers can drop packets, and such “admission decisions” can be based on both size and duration. Forcing time limits puts transience into storage allocation, giving it some of the fluidity of datagram delivery.
The semantics of IBP storage allocation are weaker than the typical storage service. Chosen to model storage accessed over the network, it is assumed that an IBP storage resource can be transiently unavailable. Since the user of remote storage resources is depending on so many uncontrolled remote variables, it may be necessary to assume that storage can be permanently lost. Thus, IBP is a “best effort” storage service. To encourage the sharing of idle resources, IBP even supports “volatile” storage allocation semantics, where allocated storage can be revoked at any time. In all cases, such weak semantics mean that the level of service must be characterized statistically.
Because of IBP’s limitations on the size and duration of allocation and its weak allocation semantics, IBP does not directly implement reliable storage abstractions such as conventional files. Instead, these must be built on top of IBP using techniques such as redundant storage, much as TCP builds on IP’s unreliable datagram delivery in order to provide reliable transport.
CHAPTER 2 IBP STRUCTURE AND CLIENT API________________________________________________
IBP has been designed to be a minimal abstraction of storage to serve the needs of logistical networking, i.e. to manage the path of data through both time and space. The fundamental operations are:
1. Allocating a byte array for storing data.
2. Moving data from a sender to a byte array.
3. Delivering data from a byte array to a receiver (either another byte array or a client).
A client API for IBP that consists of seven procedure calls, and server daemon software that makes local storage available for remote management has been defined and implemented. Currently, connections between clients and servers are made through TCP/IP sockets.
IBP client calls may be made by anyone who can attach to an IBP server (which we also call an IBP depot to emphasize its logistical functionality). IBP depots require only storage and networking resources, and running one does not necessarily require supervisory privileges. These servers implement policies that allow an initiating user some control over how IBP makes use of storage. An IBP server may be restricted to use only idle physical memory and disk resources, or to enforce a time-limit on all allocations, ensuring that the host machine is either not impacted at all, or only impacted for a finite duration. The goal of these policies is to encourage users to experiment with logistical networking without over-committing server resources.
Logically speaking, the IBP client sees a depot’s storage resources as a collection of append-only byte arrays. There are no directory structures or client-assigned file names. Clients initially gain access to byte arrays by allocating storage on an IBP server. If the allocation is successful, the server returns three capabilities to the client: one for reading, one for writing, and one for management. These capabilities can be viewed as names that are assigned by the server. Currently, each capability is a text string encoded with the IP identity of the IBP server, plus other information to be interpreted only by the server. This approach enables applications to pass IBP capabilities among themselves without registering these operations with IBP, and in this way supports high-performance without sacrificing the correctness of applications. The IBP client API consists of seven procedure calls, broken into the following three groups:
Allocation:
IBP_cap_set IBP_allocate(char *host, int size, IBP_attributes attr)
Reading / Writing:
IBP_store(IBP_cap write_cap, char *data, int size)
IBP_remote_store(IBP_cap write_cap, char *host, int port, int size)
IBP_read(IBP_cap read_cap, char *buf, int size, int offset)
IBP_deliver(IBP_cap read_cap, char *host, int port, int size, int offset)
IBP_copy(IBP_cap source, IBP_cap target, int size, int offset)
Management:
IBP_manage(IBP_cap manage_cap, int cmd, int capType, IBP_status info)
2.1 Allocation The heart of IBP’s innovative storage model is its approach to allocation. Storage resources that are part of the network, as logistical networking intends them to be, cannot
be allocated in the same way as they are on a host system. Any network is a collection of shared resources that are allocated among the members of a highly distributed community. A public network serves a community which is not closely affiliated and which may have no social or economic basis for cooperation. To understand how IBP needs to treat allocation of storage for the purposes of logistical networking, it is helpful to consider the problem of sharing resources in the Internet, and how that situation compares with the allocation of storage resources on host systems.
In the Internet, the basic shared resources are data transmission and routing. The greatest impediment to sharing these resources is the risk that their owners will be denied the use of them. The reason that the Internet can function in the face of the possibility of denial-of-use attacks is that it is not possible for the attacker to profit in proportion to their own effort, expense and risk. When other resources, such as disk space in spool directories, are shared, we tend to find administrative mechanisms that limit their use by restricting either the size of allocations or the amount of time for which data will be held.
By contrast, a user of a host storage system is usually an authenticated member of some community that has the right to allocate certain resources and to use them indefinitely. Consequently, sharing of resources allocated in this way cannot extend to an arbitrary community. For example, an anonymous FTP server with open write permissions is an invitation for someone to monopolize those resources; such servers must be allowed to delete stored material at will.
In order to make it possible to treat storage as a shared network resource, IBP supports some of these administrative limits on allocation, while at the same time seeking to provide guarantees that are as strong as possible for the client. So, for example, under IBP allocation can be restricted to a certain length of time, or specified in a way that permits the server to revoke the allocation at will. Clients who want to find the maximum resources available to them must choose the weakest form of allocation that their application can use.
To allocate storage at a remote IBP depot, the client calls IBP allocate(). The maximum storage requirements of the byte array are noted in the size parameter, and additional attributes (described below) are included in the attr parameter. If the allocation is successful, a trio of capabilities is returned.
There are several allocation attributes that the client can specify:
Permanent Vs Time-limited: The client can specify whether the storage is intended to live forever, or whether the server should delete it after a certain period of time.
Volatile Vs. stable: The client can specify whether the server may revoke the storage at any time (volatile) or whether the server must maintain the storage for the lifetime of the buffer.
Byte-array/ Pipe/ Circular-queue: The client can specify whether the storage is to be accessed as an append-only byte array, as a FIFO pipe (read one end, write to another), or as a circular queue where writes to one end push data off of the other end once a certain queue length has been attained. We anticipate that applications making use of shared network storage, (storage that they do not explicitly own), will be constrained to allocate either permanent and volatile storage, or time-limited and stable storage.
2.2 Reading / Writing All reading and writing to IBP byte arrays is done through the four reading/writing calls defined above.
These calls allow clients to read from and write to IBP buffers. IBP store() and IBP read() allow clients to read from and write to their own memory. IBP remote store() and IBP deliver() allow clients to direct an IBP server to connect to a third party (via a socket) for writing/reading. Finally, IBP copy() allows a client to copy an IBP buffer from one server to another. Note that IBP remote store(), IBP deliver() and IBP copy() all allow a client to direct an interaction between two other remote entities. The support that these three calls provide for third party transfers are an important part of what makes IBP different from, for example, typical distributed file systems. The semantics of IBP store(), IBP remote store(), and IBP copy() are append-only. IBP read(), IBP deliver() and IBP copy() allow portions of IBP buffers to be read by the client or third party. If an IBP server has removed a buffer, due to a time-limit expiration or volatility, these client calls simply fail, encoding the reason for failure in an IBP errno variable.
2.3 Management / Monitoring All management of IBP byte arrays is performed through the IBP manage() call. This procedure requires the client to pass the management capability returned from the initial IBP allocate() call. With IBP manage(), the client may manipulate a server-controlled reference count of the read and write capabilities. When the reference count of a byte array’s write capability reaches zero, the byte array becomes read-only. When the read capability reaches zero, the byte array is deleted. Note that the byte array may also be removed by the server, due to time-limited allocation or volatility. In this case, the server invalidates all of the byte array’s capabilities. The client may also use IBP manage() to probe the state of a byte array and its IBP server, to modify the time limit on time-limited allocations, and to modify the maximum size of a byte array.
CHAPTER 3 IBP DESIGN ISSUES________________________________________________
A design for shared network storage technology that draws on the strengths of the Internet model must also cope with its weaknesses and limitations. In this section we sketch the key design issues that we have encountered in developing IBP.
3.1 SecurityThe basis of IBP network security is the capability. Capabilities are created by the depot in response to an allocation request and returned to the client in the form of long, cryptographically secure byte strings. Every subsequent request for actions on the allocated byte array must then present the appropriate capability. As long as capabilities are transmitted securely between client and server, and the security of the depot itself is not compromised, only someone who has obtained the capability from the client can perform operations on the byte array. It is important to notice that this is the only level of security that IBP must deal with: the data encryption, because of its end-to-end nature, has to be handled in the layer(s) above IBP. For example, the IBP-mail application see encrypts all the data sent, but this is, and has to be, completely transparent to the IBP depot. The same applies for data compression and any other end-to-end service.
3.2 TimeoutsTimeouts are implemented in both the IBP library and server for different purposes. The client library deals with the possibility that faults in the network or server will cause a particular request to hang. On the server side, the fact that reading and writing to a byte array may occur simultaneously means that a particular IBP_load or IBP_store operation may not be able to fulfill its byte count immediately; to avoid such problems, a server timeout (or Server Sync) has been implemented to let the server wait a specified amount of time for the conditions requested to be available before answering a particular request.
3.3 Data MoverSince the primary intent of IBP is to provide a common abstraction of storage, it is arguable that third party transfer is unnecessary. Indeed, it is logically possible to build an external service for moving data between depots that access IBP allocations using only the IBP_load and IBP_store; however, such a service would have to act as a proxy for clients, and this immediately raises trust and performance concerns. The IBP_copy and IBP_mcopy data movement calls were provided in order to allow a simple implementation to avoid these concerns, even if software architectures based on external data movement operation are still of great interest to us. The intent of the basic IBP_copy call is to provide access to a simple data transfer over a TCP stream. This call is built in the IBP depot itself, to offer a simple solution for data movement; to achieve the transfer, the depot that receives the call is acting as a client doing an IBP_store on the target depot. IBP_mcopy is a more general facility, designed to provide access to operations that range from simple variants of basic TCP-based data transfer to highly complex protocols using multicast and real-time adaptation of the transmission protocol, depending on the nature of the underlying backbone and of traffic concerns. In all cases, the caller has the capacity to determine the appropriateness of the operation to the depot’s network environment, and to select what he believes to be the best data transfer strategy; a similar control is given over any error correction plan, should the data movement call return in failure.
3.4 Depot Allocation Policy and Client Allocation StrategiesOne of the key elements of the IBP design is the ability of depots to refuse allocations and the strategies used by clients in the face of such refusal. IBP depots implement an allocation policy based on both time and space. Any allocation has to be lower than the curve defined by “space*time = K”, where K varies linearly with the remaining free space. This permits smaller allocations to have a longer duration. This policy gives us some defense to DoU attacks, because any allocation of storage that depletes a large fraction of the remaining space in a particular depot will tend to be limited to a short duration, thereby limiting the impact on other depot users.
CHAPTER 4 THE EXNODE_______________________________________________
One of the key strategies of the IBP project was adhering to a very simple philosophy: IBP models access layer storage resources as closely as possible while still maintaining the ability to share those resources between applications. While this gives rise to a very general service, it results in semantics that might be too weak to be conveniently used directly by many applications. As in the networking field, the IP protocol alone does not guarantee many highly desirable characteristics and it needs to be complemented by a transport protocol such as TCP. What is needed is a service at the next layer that, working in synergy with IBP, implements stronger allocation properties such as reliability and arbitrary duration that IBP does not support.
4.1 IntroductionFollowing the example of the Unix inode, which aggregates disk blocks to implement a file, single generalized data structure is implemented, called an external node, or exNode, in order to manage aggregate allocations that can be used in implementing network storage with many different strong semantic properties. Rather than aggregating blocks on a single disk volume, the exNode aggregates storage allocations on the Internet, and the exposed nature of IBP makes IBP byte-arrays exceptionally well adapted to such aggregations (Figure 1). In the present context the key point about the design of the exNode is that it has allowed us to create an abstraction of a network file to layer over IBP-based storage in a way that is completely consistent with the exposed resource approach.
The exNode library has a set of calls (Table 2) that allow an application to create and destroy an exNode, to add or delete a mapping from it, to query it with a set of criteria, and to produce an XML serialization, to allow the use of XML-based tools and interoperability with XML systems and protocols.
Figure 4.1: Inode compared to exNode – The exNode implements a file turned inside
out.
exNode Management Segment Management
XndCreateExNodexndFreeExNode xndCreateSegmentxndFreeSegmentxndAppendSegmentxndDeleteSegment
Segment Query exNode Serialization
XndQueryxndEnumNextxndFreeEnum XndSerializexndDeserialize
4.2 Mobile Control State Of A Network FileThe exNode implements an abstract data structure that represents information known about the storage resources implementing a single file. The exNode is a set of declarations and assertions that together describe the state of the file. For purposes of illustration in this we introduce a small subset of the exNode specification and a minimal example of its application.
4.2.1 A Simple exNode APIIn this minimal formulation, the exNode is a set of mappings, each of which specifies that a particular portion of the file’s byte extent during a certain period of time is mapped to a storage resource specifier that is given by a string (a URL or IBP capability). A minimal exNode API must give us a means to create and destroy these sets of mappings, as well as a way of building them.
1. Creation and destruction are implemented by simple constructor and destructor functions.
xnd_t n = xnd_create()
xnd_destroy(xnd_t n)
2. An exNode is built by an operation that adds a mapping by specifying a data extent (start byte and length) a temporal extent (start time and duration) and a storage resource specifier.
xnd_addmap(xnd_t n, unsigned int data_start, unsigned int data_length,time_t
time_start,time_t time_length, char *storage)
3. The simplest possible useful query to an exNode simply finds one (of possibly many) storage resource descriptor and offset that holds the nth byte of the data extent at a specified time.
xnd_bytequery(xnd_t n, unsigned int byte_pos, time_t when, char **storage, unsigned int *offset);
This minimal exNode API can be extended in a number of ways that have been left out of this account for the sake of clarity, and to keep from having to introduce additional structure. Some of these extensions include:
1. Queries can be much more complex, specifying ranges of data and time, and returning sets of storage resources with associated metadata to direct the process of retrieving data.
2. Mappings can be annotated to specify read-only or write-only data.
3. As storage allocations expire or become unavailable it will be necessary to manage the exNode by finding and deleting mappings, and this will require additional mapping management calls.
4. By associating a mapping with a set of storage specifiers and an aggregation function, it possible to model group allocations such as RAID-like error correction.
5. By defining metrics on the location or performance or other characteristics of different storage allocations it is possible to inform the user of the exNode which of multiple alternatives to choose.
4.2.2 XML Serialization of the exNodeThe mobility of the exNode is based on two premises:
1. It is possible to populate the exNode exclusively with network-accessible storage resources.
2. The exNode can be encoded in a portable way that can be interpreted at any node in the network Today, XML is the standard tool used to implement portable encoding of structured data, and so we are defining a standard XML serialization of the exNode. The serialization is based on the abstract exNode data structure, and so allows each node or application to define its own local data structure.
4.2.3 Sample exNode Applications1. IBP-Mail is a simple application that uses IBP to store mail attachments rather than include them in the SMTP payload using MIME encoding. IBP-Mail builds an exNode to represent the attached file and then sends the XML serialization of that file in the SMTP payload. The receiver can then rebuild an exNode data structure and use it to access the stored attachment.
2. A simple distributed file system can be built by storing serialized exNodes in the host file system and using them like Unix soft links. Programs that would normally access a file instead find the exNode serialization, build an exNode data structure and use it to access the file. Caching can be implemented by creating a copy of accessed data on a nearby IBP depot and entering the additional mappings into the stored exNode.
3. A mobile agent that uses IBP depots to store part of its state can carry that state between hosts in the form of a serialized exNode. If the hosts understand the exNode serialization, then they can perform file system tasks for the agent while it is resident, returning the updated exNode to the agent when it migrates.
4.3 Active File Management Using The exNodeIn conventional file systems, many users consider the mapping of files to storage resources as static or changing only in response to end-user operations, but in fact this is far from true:
1. Even in a conventional disk-based file system, detection of impending failure of the physical medium can result in the movement of data and remapping of disk block addresses.
2. Defragmentation and backups can be another examples of autonomous movement of data by the files system not driven by end-user operations.
3. In a RAID storage system, partial or complete failure of a disk results in regeneration and remapping of data to restore fault tolerance.
4. In network-based systems, scanning for viruses can be a necessary autonomous action resulting in deletion of files or movement to a quarantine area.
The exNode is most closely analogous to the state of a process when it is used to implement an autonomous activity that is not under the direct control of a client, and may be completely hidden. The following activities are examples of “file-process” activity.
4.3.1 Active Probing to Maintain Fault ToleranceThe exNode can be used to express simple redundant storage of data or, with the appropriate extension, to express storage aggregation functions such as redundant storage of error correcting codes. However, as with RAID systems, when failures are detected, data must be copied to reestablish redundancy. In the case of the network, which can experience partitions and other extreme events that cause depots to become unreachable, active probing may be required in order to ensure that data losses are detected in a timely manner. This can be accomplished by the host setting a timer and actively probing every depot for reachability. Because transient network partitions are common, this data must be correlated over time to deduce the likelihood of data loss or long-term unavailability. The frequency of probing may be adjusted to account for network conditions or the urgency of constant availability required by the application.
4.3.2 DefragmentationIn any system that responds to congestion by locally restricting the size of an allowable allocation, fragmentation can result. In this regard IBP is no exception: when depots become full, they limit the size of an allowable allocation, and clients will begin fragmenting the files they represent using the exNode (see Figure 4.1). For the case of network storage, fragmentation results in a loss of reliability, requiring increased forward error correction in the form of error coding or full duplication of stored data. This can put undesirable loads on the host system and actually increases the global demand for storage – at some point allocations simply fail.
Congestion can be caused by the under-provisioning of the network with storage resources, but it can also be caused by network partitioning that makes remote storage resources unavailable to the host system. Thus, storage congestion can be transient, but when it is relieved, the fragmentation that has been caused by it can persist. What is required is the merging of fragmented allocations. While it would be possible to attempt the wholesale defragmentation of an exNode, this may place a burden on the host system, and if attempted at a time when congestion is not yet relieved may be fruitless. Instead,
the host system may choose to attempt defragmentation through more aggressive lease renewal, combined with attempted merging of adjacent fragments. Over time, this will lead to a natural defragmentation up to the point where depots resist larger allocations.
4.3.3 Asynchronous Transfer ManagementThe current IBP client API implements a simple set of synchronous calls for all storage allocation, management and data transfer operations. However, large data transfers can take a very long time, sometimes longer than the exNode will reside at the client system at which it was initiated. In this case, the data transfer call must itself generate a capability to represent the state of the transfer in process. In order to find out the result of the data transfer operation and service the exNode accordingly, the receiver of the exNode must obtain the data transfer capability. Thus, data transfer capabilities and metadata describing the transfers they represent must be part of the exNode in transit. When the exNode is transferred between hosts before reaching its final destination (as when it is held by a mail spool server) that intermediate host can interpret and service the exNode (for instance representing a mail attachment). A further compliation arises when the intent of the host is not to initiate a single data transfer, but a set of dependent transfers, as when a mail message is routed to a set of receivers, making copies in a tree-structured manner. In this case, the sequence of operations and the dependences between them must be encoded in the exNode, and the processing of the exNode can involve issuing new data transfers as their dependences are resolved.
CHAPTER 5 EXAMPLE: IBP MAIL________________________________________________
IBP-Mail is an improvement to the current state of the art in mailing large files over the Internet. It arises from the addition of writable storage to the pool of available Internet resources. With IBP-Mail, a sender registers a large file with an IBP-Mail agent, and stores it into a network storage depot. The sender then mails a pointer to the file to the recipient using the standard SMTP mailing protocol. When the receiver reads the mail message, he or she contacts the agent to discover the file’s whereabouts, and then downloads the file. In the meantime, the agent can route the file to a storage depot close to the receiver. IBP-Mail allows for an efficient transfer of the file from sender to receiver that makes use of the asynchronous nature of mail transactions, does not expend extra spooling resources of the sender or receiver, and utilizes network resources efficiently and safely.
5.1 IntroductionElectronic mail is the de facto mechanism for communication on the Internet. In an electronic mail transaction, a sender desires to transmit a message to one or more recipients. Typically, this is done using the SMTP mailing Protocol. Electronic mail works extremely well for most mailing usages. However, the model has limitations when the sender desires to transmit large data files to one or more receivers. Given the current network infrastructure, there are four basic ways that a sender can employ to transmit a large data file to one or more recipients:
1. Send the file as an encoded mail message:
One of the earliest email tools is Unix’s uuencode. Uuencode transforms any binary data file into an ASCII message that can be sent and received by virtually all mail clients. Uudecode then transforms the message back into a binary data file. Later in time, the MIME protocol was developed as a standard format for attaching non-textual files to mail messages and identifying their types so that mail clients can bundle multiple data files into an email message, send them to recipients, and then have the recipients unbundle them and launch file-specific applications to act on them. While the MIME standard has certainly made the task of sending and receiving files much easier than using
uuencode/uudecode, they share the same fundamental limitations when the files are very large. First, in order to work across all SMTP daemons, the files must be encoded into printable ASCII, which expands them roughly by a factor of 1.4. Second, spooler space at both the sender and the receiver must be expended for the file. Typically, spooler space is allocated by an institution’s system administrator and is shared by all users of the system. Often, there is a limit on the amount of data that may be stored in the spooler space, meaning that files over a certain size cannot be sent with MIME or uuencode encoding. Moreover, if a sender sends to multiple recipients, a separate copy of the mail message will be stored for each recipient, even if recipients share the same spooling space. Therefore, for large files, the uuencode/MIME mechanism for sending the file is often untenable.
2. Store the file locally and mail a pointer:
A typical example has the sender storing the file in a local HTTP or anonymous FTP server, and mailing the receiver a pointer to it. When the receiver reads the mail, he or she downloads the file. This solution solves the problem of expending spooler resources at the sender and receiver, but still has limitations. First, the sender may not have access to an HTTP or FTP server that contains sufficient storage. Second, a significant period of time may pass between the time that the sender sends the message and the time that the receiver downloads the file. When the file is served at a local HTTP or FTP server, this time is wasted, because the receiver will not start the download until he or she reads the message. Third, there is no automatic mechanism to inform the sender that the file has been downloaded, and may therefore be deleted from the server. And fourth, since HTTP and anonymous FTP downloads may be initiated by anyone on the Internet and both protocols have directory listing operations, the file may be discovered and read by unintended recipients.
3. Upload the file remotely and mail a pointer:
A typical example of this is when the receiver owns an anonymous FTP server with an incoming directory in which anonymous users may upload files. The sender then uploads the file and mails the receiver a pointer to it. The receiver may then download the file upon reading the mail, and delete it when the download is finished. This solution solves many of the problems with the previous two solutions, but has three limitations. First, the sender has to wait for the file to be stored remotely before he or she may mail the pointer. Second, the receiver, in allowing anonymous users to write to his or her FTP server, opens the server to invasion by malicious users. Finally, the receiver may not have access to such a server.
4. Save the file to tape and use surface mail:
While this solution seems truly primitive, it is sometimes the only way to transmit very large files from a sender to a receiver.
5.2 The IBP-Mail SolutionThe IBP-Mail solution to this problem relies on the existence of writable storage depots in the network. Given such depots, the sending of large data files can be performed with near optimal efficiency. With IBP-Mail, the storage depots are registered with IBP-Mail agents. The sender registers with one of these agents and stores the file at a nearby depot. The sender then mails a message with the file’s ID, as assigned by the agent. Meanwhile, the agent attempts to move the file to a depot close to the receiver. When the receiver reads the message, he or she contacts the agent to discover the whereabouts of the file, and downloads it from the appropriate depot. The file may then be deleted from the depot. The IBP-Mail solution exhibits none of the problems of the above solutions:
1. The file is not expanded due to mailer limitations.
2. No extra storage resources are required at the sender or receiver.
3. If enough time has passed between the message’s initiation and the receiver’s reading of the message, the file should be close to the receiver for quick downloading.
4. The sender does not have to wait for the file to reach the receiver before sending the message.
5. Storage resources are not opened to malicious users.
6. Multiple recipients may download shared copies of the file.
Additionally, the IBP-Mail model may be extended to allow other interesting functionalities such as compression, data mining, fragmentation of very large data files, and uses of remote compute cycles.
IBP is software that allows applications to manage and make use of remote storage resources. It is structured as server daemons that run at the storage sites, serving up dedicated disk, spare disk, and physical memory to IBP clients. IBP clients can run anywhere on the Internet and do not have to be authenticated with the IBP servers. Thus, when an IBP server is running, anyone may use it. There are several features of IBP’s design that make offering storage as a network resource feasible:
There are no user-defined names:
IBP clients allocate storage, and if the allocation is successful, then it returns three capabilities to the client — one each for reading, writing, and management. These capabilities are text strings, and may be viewed as server-defined names for the storage. The elimination of user-defined names facilitates scalability, since no global namespace needs to be maintained. Additionally, there is no way to query an IBP server for these capabilities, so unlike an anonymous FTP server, there is no way for a user to gain access to a file unless he or she knows its name a priori.
Storage may be constrained to be volatile or time limited:
An important issue when serving storage to Internet applications is being able to reclaim the storage. IBP servers may be configured so that the storage allocated to IBP clients is volatile, meaning it can go away at any time, or time-limited, meaning that it goes away after a specified time period.
Clients can direct remote data operations:
One of IBP’s primitive operations is the ability to copy a file from one IBP server to another. This is ideal for applications that need to route data from one place to another, and direct this routing from a third party.
Reference counts are maintained on the files:
One operation that IBP clients may perform with management capabilities is an explicit increment and decrement of reference counts for reading and writing. When the write reference count is zero, the file becomes read-only. When the read reference count becomes zero, the file is deleted.
5.3 The Structure Of IBP-MailIBP-Mail has been designed with three goals in mind. First, it must work, and be available to general Internet users with a minimum of effort. Second, it must solve the problems with mailing large data files as detailed in the Introduction. Third, it must facilitate testing.
IBP-Mail consists of three components, as depicted in Figure 1. These are:
A pool of IBP servers:
Only one server is necessary in the pool for IBP-Mail to work. Ideally the pool will consist of a collection of servers with wide geographic distribution.
IBP-RQ: A registration/query server:
This is a server that maintains the state of the IBP server pool. Servers register with the IBP-RQ server, and clients may query the IBP-RQ server for information about
servers. More information about the IBP-RQ server is given below.
IBP-NT: An agent for naming and transport:
The IBP-NT keeps track of where the data file is in the IBP server pool, and directs the movement of the file from server to server. More information about the IBP-NT is given below.
An IBP-Mail transaction takes the following nine steps, also outlined in Figure 1:
1. The sender contacts an IBP-NT agent. In the current implementation, there is only one such agent, but multiple agents are possible. The size of the file is communicated to the agent.
2. The IBP-NT agent queries the IBP-RQ server to list appropriate IBP servers from the server pool that can store the file.
3. The IBP-NT agent allocates storage for the file on an IBP server, and receives the capabilities for the storage.
4. The IBP-NT agent creates a name for the transaction and returns that name, plus the write capability of the file, to the sender.
5. The sender stores the file into the IBP server.
6. The sender sends mail to the receiver with the name of the transaction. At the same time, the sender informs the agent that the file has been written to the IBP server.
7. The receiver presents the name to the IBP-NT agent.
8. The IBP-NT agent returns the read and manage capabilities of the file to the receiver.
9. The receiver downloads the file from the IBP server, and may delete it if desired, by decrementing the file’s read reference count. Note that if a file is shared by multiple recipients, the agent may increment its reference count to equal the number of recipients, and then the file will be deleted only after each recipient has decremented the reference count (or when the time limit expires).
There are two steps in Figure 1 labeled with an asterisk. These may be performed by the IBP-NT agent after the sender stores the file, if the agent determines that it may be able to move the file close to the receiver. If enough time passes between steps 6 and 7 (due to the receiver not reading his or her email instantly), then this time may be used by the agent to move the file close to the receiver(s), thereby improving the time for downloading. If a receiver tries to download the file before it has been copied, it may do so from the original IBP server, with reduced performance. The current implementation of each individual component is described below.
5.3.1 The IBP-RQ Server The IBP-RQ server provides basic directory registration and query services to the IBP-NT agent. The directory part of the server is a two-level structure, with groups containing hosts (IBP servers). A host can belong to more than one group at the same time.
While the directory structure can be implemented using existing technologies (e.g. LDAP), what sets the IBPRQ server apart is the query component. Queries in IBPRQ provide a ranking of IBP servers according to selection criteria embedded into IBP-RQ. Currently, two rankings are implemented: free storage available, and proximity to a given host. The former is easy to implement given IBP semantics. The latter is more difficult. Currently, each IBP server runs a SonarDNS daemon, and the IBPRQ can force all its servers to perform Sonar performance queries to the given host. While this implementation is not a long-term solution (for example, it is not scalable to a large number of IBP servers), it suffices for our current implementation. Similarly, IBP server failures are currently not detected, and the IBP-RQ is not a distributed service, problems which will have to be solved as IBP-Mail scales.
5.3.2 The IBP-NT AgentThe IBP-NT agent provides naming and transport to IBP-Mail clients. Currently there is only one, but there could potentially be many IBP-NT agents working independently. A client starts a mail transaction by contacting an IBP-NT agent and providing initial information about the transaction: sender location and file size. The agent then queries the IBP-RQ server to find an IBP server in which the sender should insert the file. Currently, this query attempts to find a server with enough storage that is closest to the sender (using Sonar DNS’s metrics for closeness). In the future, this may be improved to take account of server load or reliability as well. The agent allocates the storage, and then stores the capabilities for that storage into a second IBP file located at a server on or near the agent’s host. We will call this the name file. The read capability of name file is returned to the sender, along with the write capability of the first IBP file. The sender stores the data into the first IBP file, and then sends mail to the receiver with the capability name file. This capability serves as a name for the mail transaction.
At the same time, the sender informs the agent that the IBP file is has been stored, and gives the location(s) of the receiver(s). Meanwhile, the agent is free to transport the data file to a server near the receiver (or receivers). This may be done with a simple third-party IBP copy() call. When finished, the agent updates the information in the IBP name file and deletes the initial data file. When the receiver presents the agent with the name, the agent returns the read and manage capabilities of the data file, and the receiver downloads it and decrements the read capability if desired.
The decision to use IBP for the name file was for flexibility. With this structure, it is possible to have multiple agents manage the transport, to have agents restart upon failure without losing information, and even to have the receiver find the location of the file without contacting the agent. The allocation of the data files and the name files are performed using IBP’s time-limited allocation policy. Currently, the time limit default is 10 days, and this may be adjusted by the sender. As discussed above, the time-limited allocation is necessary for storage owners to be able to reclaim their storage resources from the network. It also solves the lost pointer problem in the case of agent failure or lost email. Additionally, it frees the sender and receiver from having to explicitly delete the file. One ramification of this is that there is a new failure mode for mail – time limit expiration on the data file. There are several ways to address this mode – send warning mail or error reporting mail back to the sender, send warning mail to the receiver, allow the sender to extend the time limit, or simply delete the file.
5.3.3 The Current Mail InterfaceThere are currently two sender interfaces to IBP-Mail. The first is a command-line Unix program that works like mpack1 for sending MIME mail attachments. It may be obtained on the web at http://www.cs.utk.edu/˜elwasif/ ibp-mail. The second interface is web-based. Senders may use this interface by pointing their web browsers to the URL http://www.cs.utk.edu/˜elwasif/ cgi-bin/ibp-mail.cgi. This is a CGI enabled web page that allows the sender to specify a file name, plus a message, and have the message/file sent to a set of recipients using IBP-Mail. File uploading in HTTP requires the file to go to the server of the web page; therefore the web interface requires that all attachments be uploaded to Tennessee’s web server and the web server then acts as a proxy sender of the message. This makes this interface less efficient than the command-line version, since all files must go through the server at Tennessee. We are working to have the web server redirect the sender to a closer web server in a manner analogous to the receiving protocol described below. We will also write a Windows 95/98/NT sending program in the near future.
The message that the receiver gets has a MIME-encoded URL that points to the IBP-NT agent and contains the capabilities of the name file. The IBP-NT agent must be running on a host that contains a web server, so that it can resolve this URL. When the receiver clicks on the URL, the IBP-NT agent is contacted through a CGI script on its web server, and it returns a web page with an HTTP redirection link. This link contains a URL for the IBP server that holds the data file. The receiver’s web browser, upon receiving the redirection, automatically attempts to get this new URL. We have written a small HTTP gateway daemon that can run on IBP hosts so that these redirection requests may be served directly by IBP. In this way, the receiver’s browser downloads the data file directly from IBP without any knowledge of IBP.
An alternative to this interface would be to provide a stand-alone application that retrieves IBP-Mail files, which could be launched as a special MIME type by MIME enabled mailers. Or we could modify a mailer such as Pine or MH to recognize IBP-Mail attachments and download the files accordingly. We chose the web-based alternative so that users can receive IBP-Mail files without needing to install any code; indeed, they do not even have to understand what IBP-Mail is.