Harvard Computer Science Technical Reports for 1997

TR-01-97 [tr-01-97.ps.gz (62 K)]
Yu Hu, S. Lennart Johnsson, Dimitris Kehagias , and Nadia Shalaby. 1997. ``DPF: A Data Parallel Fortran Benchmark Suite.''
We present the Data Parallel Fortran (DPF) benchmark suite, a set of data parallel Fortran codes for evaluating data parallel compilers appropriate for any target parallel architecture, with shared or distributed memory. The codes are provided in basic, optimized and several library versions. The functionality of the benchmarks cover collective communication functions, scientific software library functions, and application kernels that reflect the computational structure and communication patterns in fluid dynamic simulations, fundamental physics and molecular studies in chemistry or biology. The DPF benchmark suite assumes the language model of High Performance Fortran, and provides performance evaluation metrics of busy and elapsed times and FLOP rates, FLOP count, memory usage, communication patterns, local memory access, and arithmetic efficiency as well as operation and communication counts per iteration. An instance of the benchmark suite was fully implemented in CM--Fortran and tested on the CM--5.
TR-02-97 [tr-02-97.ps.gz (220 K)]
Nick Gloy, Zheng Wang, Catherine Zhang, Brad Chen, and Mike Smith. 1997. ``Profile-Based Optimization with Statistical Profiles.''
An important barrier to the use of profile-based optimization is the generation of high-quality profile data. In this paper we describe our experience with a prototype system for continuous and automatic generation of profile data. Our system collects profile data continuously for all computation on a system by sampling the program counter once per hardware clock interrupt. We have found that the overhead of our prototype monitor is negligible, making it practical to collect profile information continuously for all activity on the system. We used several qualitative and quantitative methods to explore how the quality of these sampled profiles improves over time, and demonstrate that profiles of adequate quality for optimization can be obtained after a small number of monitored runs. For a selection of instruction-cache intensive benchmarks, statistical monitoring provides profiles of a quality comparable to that of complete profiles obtained with program instrumentation while avoiding the large run-time overheads of instrumentation-based profile collection. Our monitor is a component of the Morph System, which will make feedback-based and architecture-specific optimizations automatic and transparent to the user.
TR-03-97 [tr-03-97.ps.gz (443 K)]
Aaron Brown. 1997. ``A Decompositional Approach to Performance Evaluation.''
Today's computer users are focusing more and more of their attentions on applications heavy with multimedia and networking content, applications that stress the entire computer system from user software to operating system to hardware. Yet modern computer systems are still being designed and evaluated using standard scientific, engineering, or development workloads, as quantified by the PEC benchmarks. Such benchmarks are not representative of modern applications, nor do they stress significant parts of the computer system. This thesis provides an alternate approach to computer system performance evaluation, one that considers the system as a whole, yet still allows detailed evaluation of the performance interactions between internal system components. We first introduce hbench-OS, a set of OS and hardware microbenchmarks designed to produce accurate and reproducible measurements. We then incorporate hbench-OS and other novel profiling techniques into a methodology for constructing a performance decomposition: a quantification of performance at each abstraction boundary in the system combined with a characterization of the interactions that take place across these boundaries. We demonstrate the effectiveness of this methodology by distilling its detailed characterization at the application boundary into a single metric that correctly predicts the performance of profiled applications (e.g., the Apache web server) across a spectrum of different hardware and operating system software configurations.
TR-04-97 [tr-04-97.ps.gz (313 K)]
Igor Najfeld and Timothy F. Havel. 1997. ``Embedding with a Rigid Substructure.''
This paper presents a new distance geometry algorithm for calculating atomic coordinates from estimates of the interatomic distances, which maintains the positions of the atoms in a known rigid substructure. Given an matrix of coordinates for the rigid substructure X, this problem consists of finding the matrix that yields of global minimum of the so-called STRAIN, i.e.
                     | [ XX' XY' ]     [ A  B ] |2
                 min | [         ]  -  [      ] |
                  Y  | [ YX' YY' ]     [ B' C ] |F
where A = XX', and B, C are matrices of inner products calculated from the estimated distances.

The vanishing of the gradient of the STRAIN is shown to be equivalent to a system of only six nonlinear equations in six unknowns for the inertial tensor associated with the solution Y. The entire solution space is characterized in terms of the geometry of the intersection curves between the unit sphere and certain variable ellipsoids. Upon deriving tight bilateral bounds on the moments of inertia of any possible solution, we construct a search procedure that reliably locates the global minimum. The effectiveness of this method is demonstrated on realistic simulated and chemical test problems.

TR-05-97 [tr-05-97.ps.gz (114 K)]
Lori Park. 1997. ``Development of a Systematic Approach to Bottleneck Identification in UNIX systems.''

TR-06-97 [tr-06-97.ps.gz (54 K)]
Christopher Small and Stephen Manley. 1997. ``A Revisitation of Kernel Synchronization Schemes.''
In an operating system kernel, critical sections of code must be protected from interruption. This is traditionally accomplished by masking the set of interrupts whose handlers interfere with the correct operation of the critical section. Because it can be expensive to communicate with an off-chip interrupt controller, more complex optimistic techniques for masking interrupts have been proposed.

In this paper we present measurements of the behavior of the NetBSD 1.2 kernel, and use the measurements to explore the space of kernel synchronization schemes. We show that (a) most critical sections are very short, (b) very few are ever interrupted, (c) using the traditional synchronization technique, the synchronization cost is often higher than the time spent in the body of the critical section, and (d) under heavy load NetBSD 1.2 can spend 9% to 12% of its time in synchronization primitives.

The simplest scheme we examined, disabling all interrupts while in a critical section or interrupt handler, can lead to loss of data under heavy load. A more complex optimistic scheme functions correctly under the heavy workloads we tested and has very low overhead (at most 0.3%). Based on our measurements, we present a new model that offers the simplicity of the traditional scheme with the performance of the optimistic schemes.

Given the relative CPU, memory, and device performance of today's hardware, the newer techniques we examined have a much lower synchronization cost than the traditional technique. Under heavy load, such as that incurred by a web server, a system using these newer techniques will have noticeably better performance.

TR-07-97 [tr-07-97.ps.gz (156 K)]
A. Brown, I. Papaefstathiou, J. Simer, D. Sobel, J. Sutaria, S. Wang, T. Blackwell, M. Smith, and W. Yang. 1997. ``An IRAM-Based Architecture for a Single-Chip ATM Switch.''
We have developed an architecture for an IRAM-based ATM switch that is implemented with merged DRAM and logic for a cost of about $100. The switch is based on a shared-buffer memory organization and is fully non-blocking. It can support a total aggregate throughput of 1.2 gigabytes per second, organized in any combination of up to 32 155 Mb/sec, eight 622 Mb/sec, or four 1.2 Gb/sec full-duplex links. The switch can be fabricated on a single chip, and includes an internal 4 MB memory buffer capable of storing over 85,000 cells. When combined with external support circuitry, the switch is competitive with commercial offerings in its feature set, and significantly less expensive than existing solutions. The switch is targeted to WAN infrastructure applications such as wide-area Internet access, data backbones, and digital telephony, where we feel untapped markets exist, but it is also usable for ATM-based LANs and even could be modified to penetrate the potentially lucrative Fast and Gigabit Ethernet markets.
TR-08-97 [tr-08-97.ps.gz (132 K)]
Rebecca Hwa. 1997. ``Evaluation of Two Connectionist Approaches to Stack Representation.''
This study empirically compares two distributed connectionist learning models trained to represent an arbitrarily deep stack. One is Pollack's Recursive Auto-Associative Memory, a recurrent, back-propagating neural network that uses a hidden intermediate representation. The other is the Exponential Decay Model, a novel framework we propose here, that trains the network to explicitly model the stack as an exponentially decaying entity. We show that although the concept of a stack is learnable for both approaches, neither is able to deliver the arbitrary depth attribute. Ultimately, both suffer from the rapid rate of error propagation inherent in their recursive structures.
TR-09-97 [tr-09-97.ps.gz (91 K)]
Roni Khardon. 1997. ``Learning Action Strategies for Planning Domains.''
This paper reports on experiments where techniques of supervised machine learning are applied to the problem of planning. The input to the learning algorithm is composed of a description of a planning domain, planning problems in this domain, and solutions for them. The output is an efficient algorithm - a strategy - for solving problems in that domain. We test the strategy on an independent set of planning problems from the same domain, so that success is measured by its ability to solve complete problems. A system, L2Act, has been developed in order to perform these experiments.

We have experimented with the blocks world domain, and the logistics domain, using strategies in the form of a generalization of decision lists, where the rules on the list are existentially quantified first order expressions. The learning algorithm is a variant of Rivest`s (1987) algorithm, improved with several techniques that reduce its time complexity. As the experiments demonstrate, generalization is achieved so that large unseen problems can be solved by the learned strategies. The learned strategies are efficient and are shown to find solutions of high quality. We also discuss preliminary experiments with linear threshold algorithms for these problems.

TR-10-97 [tr-10-97.ps.gz (50 K)]
Roni Khardon. 1997. ``L2Act User Manual.''
This note describes the system L2Act, the options it includes, and how to use it. We assume knowledge of the general ideas behind the system, as well as some details on the implementation described in TR-28-95 and TR-09-97.
TR-11-97 [tr-11-97.ps.gz (217 K)]
Lillian Jane Lee. 1997. ``Similarity-Based Approaches to Natural Language Processing.''
Statistical methods for automatically extracting information about associations between words or documents from large collections of text have the potential to have considerable impact in a number of areas, such as information retrieval and natural-language-based user interfaces. However, even huge bodies of text yield highly unreliable estimates of the probability of relatively common events, and, in fact, perfectly reasonable events may not occur in the training data at all. This is known as the {\em sparse data problem}. Traditional approaches to the sparse data problem use crude approximations. We propose a different solution: if we are able to organize the data into classes of similar events, then, if information about an event is lacking, we can estimate its behavior from information about similar events. This thesis presents two such similarity-based approaches, where, in general, we measure similarity by the Kullback-Leibler divergence, an information-theoretic quantity.

Our first approach is to build soft, hierarchical clusters: soft, because each event belongs to each cluster with some probability; hierarchical, because cluster centroids are iteratively split to model finer distinctions. Our clustering method, which uses the technique of deterministic annealing, represents (to our knowledge) the first application of soft clustering to problems in natural language processing. We use this method to cluster words drawn from 44 million words of Associated Press Newswire and 10 million words from Grolier's encyclopedia, and find that language models built from the clusters have substantial predictive power. Our algorithm also extends with no modification to other domains, such as document clustering.

Our second approach is a nearest-neighbor approach: instead of calculating a centroid for each class, we in essence build a cluster around each word. We compare several such nearest-neighbor approaches on a word sense disambiguation task and find that as a whole, their performance is far superior to that of standard methods. In another set of experiments, we show that using estimation techniques based on the nearest-neighbor model enables us to achieve perplexity reductions of more than 20 percent over standard techniques in the prediction of low-frequency events, and statistically significant speech recognition error-rate reduction.

TR-12-97 [tr-12-97.ps.gz (259 K)]
Stephen Lee Manley. 1997. ``An Analysis of Issues Facing World Wide Web Servers.''
The World Wide Web has captured the public's interest like no other computer aplication or tool. In response, businesses have attempted to capitalize on the Web's popularity. As a result propaganda, assumption, and unfounded theories have taken the place of facts, scientific analysis, and well-reasoned theories. As with all things, the Web's popularity comes with a price for the first time the computer industry must satisfy exponentially increasing demand. As the World Wide Web becomes the "World Wide Wait" and the Internet changes from the "Information Superhighway" to a "Giant Traffic Jam," the public demands improvements in Web performance. /par The lack of cogent scientific analysis prevents true improvement in Web conditions. Nobody knows the source of the bottlenecks. Some assert that the server must be the problem. Others blame Internet congestion. Still others place the blame on modems or slow Local Area Networks. The Web's massive size and growth have made research difficult, but those same factors make such work indispensable. /par This thesis examines issues facing the Web by focusing on traffic patterns on a variety of servers. The thesis presents a method of categorizing different Web site growth patterns. It then disproves the theory that CGI has become an important and varied tool on most Web sites. Most importantly, however, the thesis focuses on the source of latency on the Web. An in-depth examination of the data leads to the conclusion that the server cannot be a primary source of latency on the World Wide Web.

The thesis then details the creation of a new realistic, self-configuring, scaling Web server benchmark. By using a site's Web server logs, the benchmark can create a model of the site's traffic. The model can be reduced by a series of abstractions, and scaled to predict future behavior. Finally, the thesis shows the benchmark models realistic Web server traffic, and can serve as a tool for scientific analysis of developments on the Web, and their effects on the server.

TR-13-97 [tr-13-97.ps.gz (113 K)]
Y. Charlie Hu and S. Lennart Johnsson. 1997. ``Data Parallel Performance Optimizations Using Array Aliasing .''
The array aliasing mechanism provided in the Connection Machine Fortran (CMF) language and run--time system provides a unique way of identifying the memory address spaces local to processors within the global address space of distributed memory architectures, while staying in the data parallel programming paradigm. We show how the array aliasing feature can be used effectively in optimizing communication and computation performance. The constructs we present occur frequently in many scientific and engineering applications, and include various forms of aggregation and array reshaping through array aliasing. The effectiveness of the optmization techniques is demonstrated on an implementation of Anderson's hierarchical $O(N)$$ $N$--body method.
TR-14-97 [tr-14-97.ps.gz (626 K)]
Yu Hu. 1997. ``Efficient Data Parallel Implementations of Highly Irregular Problems.''
This dissertation presents optimization techniques for efficient data parallel formulation/implementation of highly irregular problems, and applies the techniques to $O(N)$ hierarchical \Nbody methods for large--scale \Nbody simulations. It demonstrates that highly irregular scientific and engineering problems such as nonadaptive and adaptive $O(N)$ hierarchical \Nbody methods can be efficiently implemented in high--level data parallel languages such as High Performance Fortran (HPF) on scalable parallel architectures. It also presents an empirical study of the accuracy--cost tradeoffs of $O(N)$ hierarchical \Nbody methods.

This dissertation first develops optimization techniques for efficient data parallel implementation of irregular problems, focusing on minimizing the data movement through careful management of the data distribution and the data references, both between the memories of different nodes, and within the memory hierarchy of each node. For hierarchical \Nbody methods, our optimizations on improving arithmetic efficiencies include recognizing dominating computations as matrix--vector multiplications and aggregating them into multiple--instance matrix--matrix multiplications. Experimental results with an implementation in Connection Machine Fortran of Anderson's hierarchical \Nbody method demonstrate that performance competitive to that of the best message--passing implementations of the same class of methods can be achieved.

The dissertation also presents a general data parallel formulation for highly irregular applications, and applies the formulation to an adaptive hierarchical \Nbody method with highly nonuniform particle distributions. The formulation consists of (1) a method for linearizing irregular data structures, (2) a data parallel implementation (in HPF) of graph partitioning algorithms applied to the linearized data structure, and (3) techniques for expressing irregular communications and nonuniform computations associated with the elements of linearized data structures. Experimental results demonstrate that efficient data parallel (HPF) implementations of highly nonuniform problems are feasible with proper language/compiler/runtime support. Our data parallel $N$--body code provides a much needed ``benchmark'' code for evaluating and improving HPF compilers.

This thesis also develops the first data parallel (HPF) implementation of the geometric partitioning algorithm due to Miller, Teng, Thurston and Vavasis -- one of the only two provably good partitioning schemes. Our data parallel formulation makes extensive use of segmented prefix sums and parallel selections, and provides a data parallel procedure for geometric sampling. Experiments on partitioning particles for load--balance and data interactions as required in hierarchical \Nbody algorithms show that the geometric partitioning algorithm has an efficient data parallel formulation.

Finally, this thesis studies the accuracy--cost tradeoffs of $O(N)$ hierarchical \Nbody methods using our implementation of nonadaptive Anderson's method. The various parameters which control the degree of approximation of the computational elements and separateness of the interacting computational elements, govern both the arithmetic complexity and the accuracy of the methods. A scheme for choosing optimal parameters that give the best running time for a prescribed error requirement is developed. Using this scheme, we find that for a prescribed error, using a near--field containing only nearest neighbor boxes and the optimal hierarchy depth which minimizes the number of arithmetic operations, minimizes the number of arithmetic operations and therefore the total running time.

TR-15-97 [tr-15-97.ps.gz (247 K)]
Christine Hisayo Nakatani. 1997. ``The Computational Processing of Intonational Prominence: A Functional Prosody Perspective.''

TR-16-97 [tr-16-97.ps.gz (80 K)]
Christopher Small, Narendra Ghosh, Hany Saleeb, Margo Seltzer, and Keith Smith. 1997. ``Does Systems Research Measure Up?.''
We surveyed more than two hundred systems research papers published in the last six years, and found that, in experiment after experiment, systems researchers measure the same things, but in the majority of cases the reported results are not reproducible, comparable, or statistically rigorous. In this paper we present data describing the state of systems experimentation and suggest guidelines for structuring commonly run experiments, so that results from work by different researchers can be compared more easily. We conclude with recommendations on how to improve the rigor of published computer systems research.
TR-17-97 [tr-17-97.ps.gz (54 K)]
Stephen Manley, Network Appliance, Michael Courage, Microsoft Corporation, and Margo Seltzer, Harvard University. 1997. ``A Self-Scaling and Self-Configuring Benchmark for Web Servers.''
World Wide Web clients and servers have become some of the most important applications in our computing base today, and as such, we need realistic and meaningful ways of capturing their performance. Current server benchmarks do not capture the wide variation that we see in servers and are not accurate in their characterization of web traffic. In this paper, we present a self-configuring, scalable benchmark, that generates a server benchmark load based on actual server loads. In contrast to other web benchmarks, our benchmark characterizes request latency instead of focusing exclusively on throughput sensitive metrics. We present our new benchmark, hbench:Web, and demonstrate how it accurately captures the load observed by an actual server. We then go on to show how it can be used to assess how continued growth or changes in the workload will affect future performance. Using existing log histories, we show that these predictions are sufficiently realistic to provide insight into tomorrow's web performance.
TR-18-97 [tr-18-97.ps.gz (35 K)]
Margo I. Seltzer, Yasuhiro Endo, Christopher Small, and Keith A. Smith. 1997. ``Issues in Extensible Operating Systems.''
Operating systems research has traditionally consisted of adding functionality to the operating system or inventing and evaluating new methods for performing functions. Regardless of the research goal, the single constant has been that the size and complexity of operating systems increase over time. As a result, operating systems are usually the single most complex piece of software in a computer system, containing hundreds of thousands, if not millions, of lines of code. Today's operating system research is directed at finding new ways to structure the operating system in order to increase its flexibility, allowing it to adapt to changes in the application set it must support. This paper discusses the issues involved in designing such extensible systems and the array of choices facing the operating system designer. We present a framework for describing extensible operating systems and then relate current operating systems to this framework.
TR-19-97 [tr-19-97.ps.gz (68 K)]
Leslie G. Valiant. 1997. ``Projection Learning.''
A method of combining learning algorithms is described that preserves attribute efficiency. It yields learning algorithms that require a number of examples that is polynomial in the number of relevant variables and logarithmic in the number of irrelevant ones. The algorithms are simple to implement and realizable on networks with a number of nodes linear in the total number of variables. They can be viewed as strict generalizations of Littlestone's Winnow algorithm, and, therefore, appropriate to domains having very large numbers of attributes, but where nonlinear hypotheses are sought.
TR-20-97 [tr-20-97.ps.gz (121 K)]
Ellie Baker and Margo Seltzer. 1997. ``The Mug-Shot Search Problem.''
Mug-shot search is the classic example of the general problem of searching a large facial image database when starting out with only a mental image of the sought-after face. We have implemented a prototype content-based image retrieval system that integrates composite face creation methods with a face-recognition technique (Eigenfaces) so that a user can both create faces and search for them automatically in a database. These two functions are fully integrated so that interim created com posites may be used to search the data and interim search results may, likewise, be used to modify an evolving composite.

Although the Eigenface method has been studied extensively for its ability to perform face identification tasks (in which the input to the system is an on-line facial image to identify), little research has been done to determine how effective it is as applied to the mug shot search problem (in which there is no on-line input image at the outset). With our prototype system, we have conducted a pilot user study that looks at the usefulness of eigenfaces as applied to this problem. The study shows that the eigenface method, though helpful, is an imperfect model of human perception of similarity between faces. Using a novel evaluation methodology, we have made progress at identifying specific search strategies that, given an imperfect correlation between the system and human similarity metrics, use whatever correlation does exist to the best advantage. We have also shown that the use of facial composites as query images is advantageous compared to restricting users to database images for their queries.

TR-21-97 [tr-21-97.ps.gz (235 K)]
Omr Traub, Glenn Holloway , and Michael Smith. 1997. ``Quality and Speed in Linear-scan Register Allocation.''
A linear-scan algorithm directs the global allocation of register candidates to registers based on a simple linear sweep over the program being compiled. This approach to register allocation makes sense for systems, such as those for dynamic compilation, where compilation speed is important. In contrast, most commercial and research optimizing compilers rely on a graph-coloring approach to global register allocation. In this paper, we compare the performance of a linear-scan method against a modern graph-coloring method. We implement both register allocators within the Machine SUIF extension of the Stanford SUIF compiler system. Experimental results show that linear scan is much faster than coloring on benchmarks with large numbers of register candidates. We also describe improvements to the linear-scan approach that do not change its linear character, but allow it to produce code of a quality near to that produced by graph coloring.
TR-22-97 [tr-22-97.ps.gz (117 K)]
Barbara J. Grosz and Sarit Kraus. 1997. ``The Evolution of SharedPlans.''

TR-23-97 [tr-23-97.ps.gz (1887 K)]
Kathleen Ryall. 1997. ``Computer-Human Collaboration in the Design of Graphics.''
Delineating the roles of the user and the computer in a system is a central task in user interface design. As interactive applications become more complex, it is increasingly difficult to design interface methods that deliver the full power of an application to users, while enabling them to learn and to use effectively the interface to a system. The challenge is finding a balance between user intervention and computer control within the computer-user interface.

In this thesis, we propose a new paradigm for determining this division of labor, which attempts to exploit the strengths of each collaborator, the human user and the computer system. This collaborative framework encourages the development of semi-automatic systems, through which users can explore a large number of candidate solutions, while evaluating and comparing various alternatives. Under the collaborative framework, the problem to be solved is framed as an optimization problem, which is then decomposed into local and global portions. The user is responsible for global aspects of the problem: placing the computer into different areas of the search space, and determining when an acceptable solution has been reached. The computer works at the local level, computing the local minima, displaying results to the user, and providing simple interface mechanisms to facilitate the interaction. Systems employing this approach make use of task-specific information to leverage the actions of users, performing fine-grained details while leaving the high-level aspects to the user specifiable through gross interface gestures.

We present applications of our collaborative paradigm to the design and implementation of semi-automatic systems for three tasks from the domain of graphic design: network diagram layout, parameter specification for computer graphics algorithms and floor plan segmentation. The collaborative paradigm we propose is well-suited for this domain. Systems designed under our framework support an iterative design process, an integral component of graphic design. Furthermore, the collaborative framework for computer-user interface design exploits people's expertise at incorporating aesthetic criteria and semantic information into finding an acceptable solution for a graphic design task, while harnessing a computer's computational power, to enable users to explore a large space of candidate solutions.

tr-24-97 [tr-24-97.ps.gz (382 K)]
Nadia Shalaby. 1997. ``Fast Parallel Orthogonal Transforms.''
Orthogonal transforms are ubiquitous in engineering and applied science applications. Such applications place stringent requirements on a computer system, either in terms of time, which translates into processing power, or problem size, which translates into memory size, or most often both. To achieve such high performance, these applications have been migrating to the realm of parallel computing, offering significantly more processing power and memory. This trend mandates the development of fast parallel algorithms for orthogonal transforms, that would be versatile on many parallel platforms, and is the subject of this thesis.

We present a unified approach in seeking fast parallel orthogonal transforms by establishing a theoretical formulation for each algorithm, presenting a simple specification to facilitate implementation and analysis, and by evaluating performance via a set of predefined arithmetic, memory, communication and load--imbalance complexity metrics. We adopt a bottom up approach by constructing the thesis in the form of progressively larger modular blocks, each relying on a lower level block for its formulation, algorithmic specification and performance analysis.

Since communication is the bottleneck for parallel computing, we first establish an algebraic framework for stable parallel permutations, the predominant communication requirement for parallel orthogonal transforms. We present a taxonomy categorizing stable permutations into classes of index--digit, linear, translation, affine and polynomial permutations. For each class, we demonstrate its general behavioral properties and prove permutation locality and other properties for particular examples.

These results are then applied in formulating, specifying and evaluating performance of the direct and load--balanced parallel algorithms for the fast Fourier transform (FFT); we prove the latter to be optimal. We demonstrate the versatility of our complexity metrics by substituting them into the PRAM, LPRAM, BSP and LogP computational models. Consequently, we employ the load--balanced FFT in deriving a novel polynomial--based discrete cosine transform (DCT) and demonstrate its performance advantage over the classical DCT. Finally, the polynomial DCT is used as a building block for a novel parallel fast Legendre transform (FLT), a parallelization of the Driscoll--Healey O(N log^2 N) method. We show that the algorithm is hierarchically load--balanced by composing its specification and complexity from its modular blocks.

TR-25-97 [tr-25-97.ps.gz (511 K)]
Daniel Ellard, James Megquier, Lori Park, and Nina Yuan. 1997. ``The India Protocol - Project Report.''
The goal of this project is to explore the potential of the IDA (Information Dispersal Algorithm) as the basis for an efficient, highly available, fault tolerant, and secure distributed data storage system. To this end, we have developed and implemented protocols that are both significantly more efficient and less complex than traditional replicated services, yet provide many of their benefits.