# Harvard Computer Science Technical Reports for 1994

TR-01-94 [tr-01-94.ps.gz (75 K)]
Roni Khardon and Dan Roth. 1994. Reasoning with Models.''
We develop a model-based approach to reasoning, in which the knowledge base is represented as a set of models (satisfying assignments) rather then a logical formula, and the set of queries is restricted. We show that for every propositional knowledge base (KB) there exists a set of {\em characteristic models} with the property that a query is true in KB if and only if it is satisfied by the models in this set. We fully characterize a set of theories for which the model-based representation is compact and provides efficient reasoning. These include cases where the formula-based representation does not support efficient reasoning. In addition, we consider the model-based approach to {\em abductive reasoning} and show that for any propositional KB, reasoning with its model-based representation yields an abductive explanation in time that is polynomial in its size. Some of our technical results make use of the {\em Monotone Theory}, a new characterization of Boolean functions introduced.

The notion of {\em restricted queries} is inherent to our approach. This is a wide class of queries for which reasoning is very efficient and exact, even when the model-based representation KB provides only an approximate representation of the world''.

Moreover, we show that the theory developed here generalizes the model-based approach to reasoning with Horn theories, and captures even the notion of reasoning with Horn-approximations. Our result characterizes the Horn theories for which the approach suggested in is useful and the phenomena observed there, regarding the relative sizes of the formula-based representation and model-based representation of KB is explained and put in a wider context.

TR-02-94 [tr-02-94.ps.gz (78 K)]
Roni Khardon and Dan Roth. 1994. Learning to Reason.''
We introduce a new framework for the study of reasoning. The Learning (in order) to Reason approach developed here combines the interfaces to the world used by known learning models with the reasoning task and a performance criterion suitable for it. In this framework the intelligent agent is given access to her favorite learning interface, and is also given a grace period in which she can interact with this interface and construct her representation KB of the world $W$. Her reasoning performance is measured only after this period, when she is presented with queries $\alpha$ from some query language, relevant to the world, and has to answer whether $W$ implies $\alpha$.

The approach is meant to overcome the main computational difficulties in the traditional treatment of reasoning which stem from its separation from the world''. First, by allowing the reasoning task to interface the world (as in the known learning models), we avoid the rigid syntactic restriction on the intermediate knowledge representation. Second, we make explicit the dependence of the reasoning performance on the input from the environment. This is possible only because the agent interacts with the world when constructing her knowledge representation.

We show how previous results from learning theory and reasoning fit into this framework and illustrate the usefulness of the Learning to Reason approach by exhibiting new results that are not possible in the traditional setting. First, we give a Learning to Reason algorithm for a class of propositional languages for which there are no efficient reasoning algorithms, when represented as a traditional (formula-based) knowledge base. Second, we exhibit a Learning to Reason algorithm for a class of propositional languages that is not known to be learnable in the traditional sense.

TR-03-94
Z. Johan, K.K. Mathur, S.L. Johnsson, and T.J.R. Hughes. 1994. Mesh Decomposition and Communication Procedures for Finite Element Applications on The Connection Machine CM-5 System.''
TR-08-94 SUPERCEDES TR-03-94.
TR-04-94 [tr-04-94.ps.gz (355 K)]
Z. Johan, K.K. Mathur, S.L. Johnsson, and T.J.R. Hughes. 1994. Data Parallel Finite Element Techniques for Compressible Flow Problems.''
We present a brief description of a finite element solver implemented on the Connection Machine CM-5 system. A more detailed presentation of the issues involved in such an implementation can be found in [1,2].
TR-05-94 [tr-05-94.ps.gz (70 K)]
Alex Fukunaga, Lloyd Hsu, Peter Reiss, Andrew Shuman, Jon Christensen, Joe Marks, and J. Thomas Ngo. 1994. Motion-Synthesis Techniques for 2D Articulated Figures.''
In this paper we extend previous work on automatic motion synthesis for physically realistic 2D articulated figures in three ways. First, we describe an improved motion-synthesis algorithm that runs substantially faster than previously reported algorithms. Second, we present two new techniques for influencing the style of the motions generated by the algorithm. These techniques can be used by an animator to achieve a desired movement style, or they can be used to guarantee variety in the motions synthesized over several runs of the algorithm. Finally, we describe an animation editor that supports the interactive concatenation of existing, automatically generated motion controllers to produce complex, composite trajectories. Taken together, these results suggest how a usable, useful system for articulated-figure motion synthesis might be developed.
TR-06-94 [tr-06-94.ps.gz (1470 K)]
Hadi Partovi, Jon Christensen, Amir Khosrowshahi, Joe Marks, and J. Thomas Ngo. 1994. Motion Synthesis for 3D Articulated Figures and Mass-Spring Models.''
Motion synthesis is the process of automatically generating visually plausible motions that meet goal criteria specified by a human animator. The objects whose motions are synthesized are often animated characters that are modeled as articulated figures or mass-spring lattices. Controller synthesis is a technique for motion synthesis that involves searching in a space of possible controllers to generate appropriate motions. Recently, automatic controller-synthesis techniques for 2D articulated figures have been reported. An open question is whether these techniques can be generalized to work for 3D animated characters. In this paper we report successful automatic controller synthesis for 3D articulated figures and mass-spring models that are subject to nonholonomic constraints. These results show that the 3D motion-synthesis problem can be solved in some challenging cases, though much work on this general topic remains to be done.
TR-07-94 [tr-07-94.ps.gz (620 K)]
Zdenek Johan, Kapil K. Mathur, S. Lennart Johnsson, and Thomas J.R. Hughes. 1994. Parallel implementation of recursive spectral bisection on the Connection Machine CM-5 system.''
The recursive spectral bisection (RSB) algorithm was proposed by Pothen {\em et al.} [1] as the basis for computing small vertex separators for sparse matrices. Simon [2] applied this algorithm to mesh decomposition and showed that spectral bisection compared favorably with other decomposition techniques. Since then, the RSB algorithm has been widely accepted in the scientific community because of its robustness and its consistency in the high-quality partitionings it generates. The major drawback of the RSB algorithm is its high computing cost, as noted in [2], caused by the need for solving a series of eigenvalue problems. It is often stated that an unstructured mesh can be decomposed after it is generated, and the decomposition reused for the different calculations performed on that mesh. However, a new partitioning is to be obtained if adaptive mesh refinement is required. The mesh also has to be re-decomposed if the number of processing nodes available to the user changes between two calculations. In order to avoid the mesh decomposition from becoming a significant computational bottleneck, an efficient data-parallel implementation of the RSB algorithm using the CM Fortran language [3] is developed. In this paper, we present only an abbreviated description of the parallel implementation of the RSB algorithm, followed by two decomposition examples. Details of the implementation can be found in [4].
TR-08-94 [tr-08-94.ps.gz (261 K)]
Zdenek Johan, Kapil K. Mathur, S. Lennart Johnsson, and Thomas J.R. Hughes. 1994. Mesh Decomposition and Communication Procedures for Finite Element Applications on the Connection Machine CM-5 System.''
The objective of this paper is to analyze the impact of data mapping strategies on the performance of finite element applications. First, we describe a parallel mesh decomposition algorithm based on recursive spectral bisection used to partition the mesh into element blocks. A simple heuristic algorithm then renumbers the mesh nodes. Large three-dimensional meshes demonstrate the efficiency of those mapping strategies and assess the performance of a finite element program for fluid dynamics.
TR-09-94 [tr-09-94.ps.gz (96 K)]
S. Lennart Johnsson. 1994. Data Motion and High Performance Computing.''
Efficient data motion has been of critical importance in high performance computing almost since the first electronic computers were built. Providing sufficient memory bandwidth to balance the capacity of processors led to memory hierarchies, banked and interleaved memories. With the rapid evolution of MOS technologies, microprocessor and memory designs, it is realistic to build systems with thousands of processors and a sustained performance of a trillion operations per second or more. Such systems require tens of thousands of memory banks, even when locality of reference is exploited. Using conventional technologies, interconnecting several thousand processors with tens of thousands of memory banks can feasibly only be made by some form of sparse interconnection network. Efficient use of locality of reference and network bandwidth is critical. We review these issues in this paper.
TR-10-94 [tr-10-94r.ps.gz (469 K)]
Wheeler Ruml, J. Thomas Ngo, Joe Marks, and Stuart M. Shieber. 1994. Easily Searched Encodings for Number Partitioning.''
Can stochastic search algorithms outperform existing deterministic heuristics for the NP-hard problem Number Partitioning if given a sufficient, but practically realizable amount of time? In a thorough empirical investigation using a straightforward implementation of one such algorithm, simulated annealing, Johnson et al. (1991) concluded tentatively that the answer is "no." In this paper we show that the answer can be "yes" if attention is devoted to the issue of problem representation (encoding). We present results from empirical tests of several encodings of Number Partitioning with problem instances consisting of multiple-precision integers drawn from a uniform probability distribution. With these instances and with an appropriate choice of representation, stochastic and deterministic searches can -- routinely and in a practical amount of time -- find solutions several orders of magnitude better than those constructed by the best heuristic known (Karmarkar and Karp, 1982), which does not employ searching. The choice of encoding is found to be more important than the choice of search technique in determining search efficacy. Three alternative explanations for the relative performance of the encodings are tested experimentally. The best encodings tested are found to contain a high proportion of good solutions; moreover, in those encodings, the solutions are organized into a single "bumpy funnel" centered at a known position in the search space. This is likely to be the only relevant structure in the search space because a blind search performs as well as any other search technique tested when the search space is restricted to the funnel tip. We also show how analogous representations might be designed in a principled manner for other difficult combinatorial optimization problems by applying the principles of parameterized arbitration, parameterized constraint, and parameterized greediness.

Keywords: number partitioning, NP-complete, representation, encoding, empirical comparison, stochastic optimization, parameterized arbitration, parameterized constraint, parameterized greediness.

TR-11-94 [tr-11-94.ps.gz (122 K)]
Stuart M. Shieber, Yves Schabes, and Fernando C. N. Pereira. 1994. Principles and Implementation of Deductive Parsing.''
We present a system for generating parsers based directly on the metaphor of parsing as deduction. Parsing algorithms can be represented directly as deduction systems, and a single deduction engine can interpret such deduction systems so as to implement the corresponding parser. The method generalizes easily to parsers for augmented phrase structure formalisms, such as definite-clause grammars and other logic grammar formalisms, and has been used for rapid prototyping of parsing algorithms for a variety of formalisms including variants of tree-adjoining grammars, categorial grammars, and lexicalized context-free grammars.
TR-12-94 [tr-12-94.ps.gz (183 K)]
Karen Daniels, Zhenyu Li, and Victor Milenkovic. 1994. Multiple Containment Methods.''
We present three different methods for finding solutions to the 2D translation-only {\em containment} problem: find translations for $k$ polygons that place them inside a given polygonal container without overlap. Both the container and the polygons to be placed in it may be non-convex. First, we provide several exact algorithms that improve results for $k=2$ or $k=3$. In particular, we give an algorithm for three convex polygons and a non-convex container with running time in ${\rm O}(m^3nlog mn)$, where $n$ is the number of vertices in the container, and $m$ is the sum of the vertices of the $k$ polygons. This is an improvement of a factor of $n^2$ over previous algorithms. Second, we give an approximation algorithm for $k$ non-convex polygons and a non-convex container based on restriction and subdivision of the configuration space. Third, we develop a MIP (mixed integer programming) model for $k$ non-convex polygons and a non-convex container.
TR-13-94 [tr-13-94.ps.gz (37 K)]
Bryan Mazlish, Stuart Shieber, and Joe Marks. 1994. A Recursive Coalescing Method for Bisecting Graphs.''
We present an extension to a hybrid graph-bisection algorithm developed by Bui et al. that uses vertex coalescing and the Kernighan-Lin variable-depth algorithm to minimize the size of the cut set. In the original heuristic technique, one iteration of vertex coalescing is used to improve the performance of the original Kernighan-Lin algorithm. We show that by performing vertex coalescing recursively, substantially greater improvements can be achieved for standard random graphs of average degree in the range [2.0,5.0].
TR-14-94 [tr-14-94.tar.Z (758 K)]
Rahul Razdan. 1994. PRISC: Programmable Reduced Instruction Set Computers.''
This thesis introduces Programmable Reduced Instruction Set Computers (PRISC) as a new class of general-purpose computers. PRISC use RISC techniques as a base, but in addition to the conventional RISC instruction resources, PRISC offer hardware programmable resources which can be configured based on the needs of a particular application. This thesis presents the architecture, operating system, and programming language compilation techniques which are needed to successfully build PRISC. Performance results are provided for the simplest form of PRISC -- a RISC microprocessor with a set of programmable functional units consisting of only combinational functions. Results for the SPECint92 benchmark suite indicate that an augmented compiler can provide a performance improvement of 22% over the underlying RISC computer with a hardware area investment less than that needed for a 2 kilobyte SRAM. In addition, active manipulation of the source code leads to significantly higher local performance gains (250%-500%) for general abstract data types such as short-set vectors, hash tables, and finite state machines. Results on end-user applications that utilize these data types indicate a performance gain from 32%-213%.
TR-15-94 [tr-15-94.ps.gz (452 K)]
Zhenyu Li. 1994. Compaction Algorithms for Non-Convex Polygons and Their Applications.''
Given a two-dimensional, non-overlapping layout of convex and non-convex polygons, {\em compaction} refers to a simultaneous motion of the polygons that generates a more densely packed layout. In industrial two-dimensional packing applications, compaction can improve the material utilization of already tightly packed layouts. Efficient algorithms for compacting a layout of non-convex polygons are not previously known.

This dissertation offers the first systematic study of compaction of non-convex polygons. We start by formalizing the compaction problem as that of planning a motion that minimizes some linear objective function of the positions. Based on this formalization, we study the complexity of compaction and show it to be PSPACE-hard.

The major contribution of this dissertation is a position-based optimization model that allows us to calculate directly new polygon positions that constitute a locally optimum solution of the objective via linear programming. This model yields the first practically efficient algorithm for translational compaction--compaction in which the polygons can only translate. This compaction algorithm runs in almost real time and improves the material utilization of production quality human-generated layouts from the apparel industry.

Several algorithms are derived directly from the position-based optimization model to solve related problems arising from manual or automatic layout generation. In particular, the model yields an algorithm for separating overlapping polygons using a minimal amount of motion. This separation algorithm together with a database of human-generated markers can automatically generate markers that approach human performance.

Additionally, we provide several extensions to the position-based optimization model. These extensions enables the model to handle small rotations, to offer flexible control of the distances between polygons and to find optimal solution to two-dimensional packing of non-convex polygons.

This dissertation also includes a compaction algorithm based on existing physical simulation approaches. Although our experimental results showed that it is not practical for compacting tightly packed layouts, this algorithm is of interest because it shows that the simulation can speed up significantly if we use geometrical constraints to replace physical constraints. It also reveals the inherent limitations of physical simulation algorithms in compacting tightly packed layouts.

Most of the algorithms presented in this dissertation have been implemented on a SUN ${\rm SparcStation}^{\rm TM}$ and have been included in a software package licensed to a CAD company.

TR-16-94 [tr-16-94.ps.gz (620 K)]
Zdenek Johan, Kapil K. Mathur, S. Lennart Johnsson, and Thomas J.R. Hughes. 1994. Scalability of Finite Element Applications on Distributed-Memory Parallel Computers.''
This paper demonstrates that scalability and competitive efficiency can be achieved for unstructured grid finite element applications on distributed memory machines, such as the Connection Machine CM-5 system. The efficiency of finite element solvers is analyzed through two applications: an implicit computational aerodynamics application and an explicit solid mechanics application. Scalability of mesh decomposition and data mapping strategies are also discussed. Numerical examples that support the claims for problems with an excess of fourteen million variables are presented.
TR-17-94 [tr-17-94.ps.gz (88 K)]
Javed A. Aslam and Scott E. Decatur. 1994. Improved Noise-Tolerant Learning and Generalized Statistical Queries.''
The statistical query learning model can be viewed as a tool for creating (or demonstrating the existence of) noise-tolerant learning algorithms in the PAC model. The complexity of a statistical query algorithm, in conjunction with the complexity of simulating SQ algorithms in the PAC model with noise, determine the complexity of the noise-tolerant PAC algorithms produced. Although roughly optimal upper bounds have been shown for the complexity of statistical query learning, the corresponding noise-tolerant PAC algorithms are not optimal due to inefficient simulations. In this paper we provide both improved simulations and a new variant of the statistical query model in order to overcome these inefficiencies.

We improve the time complexity of the classification noise simulation of statistical query algorithms. Our new simulation has a roughly optimal dependence on the noise rate. We also derive a simpler proof that statistical queries can be simulated in the presence of classification noise. This proof makes fewer assumptions on the queries themselves and therefore allows one to simulate more general types of queries.

We also define a new variant of the statistical query model based on relative error, and we show that this variant is more natural and strictly more powerful than the standard additive error model. We demonstrate efficient PAC simulations for algorithms in this new model and give general upper bounds on both learning with relative error statistical queries and PAC simulation. We show that any statistical query algorithm can be simulated in the PAC model with malicious errors in such a way that the resultant PAC algorithm has a roughly optimal tolerable malicious error rate and sample complexity.

Finally, we generalize the types of queries allowed in the statistical query model. We discuss the advantages of allowing these generalized queries and show that our results on improved simulations also hold for these queries.

TR-18-94 [tr-18-94.ps.gz (29 K)]
Z. Johan, K.K. Mathur, S.L. Johnsson, and T.J.R. Hughes. 1994. Finite Element Techniques for Computational Fluid Dynamics on the Connection Machine CM-5 System.''

TR-19-94 [tr-19-94.ps.gz (66 K)]
S. Lennart Johnsson and Kapil K. Mathur. 1994. Scientific Software Libraries for Scalable Architectures.''
Massively parallel processors introduce new demands on software systems with respect to performance, scalability, robustness and portability. The increased complexity of the memory systems and the increased range of problem sizes for which a given piece of software is used, poses serious challenges to software developers. The Connection Machine Scientific Software Library, CMSSL, uses several novel techniques to meet these challenges. The CMSSL contains routines for managing the data distribution and provides data distribution independent functionality. High performance is achieved through careful scheduling of arithmetic operations and data motion, and through the automatic selection of algorithms at run--time. We discuss some of the techniques used, and provide evidence that CMSSL has reached the goals of performance and scalability for an important set of applications.
TR-20-94 [tr-20-94.ps.gz (36 K)]
Jean-Philippe Brunet, Palle Pedersen, and S. Lennart Johnsson. 1994. Load-Balanced LU and QR Factor and Solve Routines for Scalable Processors with Scalable I/O.''
The concept of block--cyclic order elimination can be applied to out--of--core $LU$ and $QR$ matrix factorizations on distributed memory architectures equipped with a parallel I/O system. This elimination scheme provides load balanced computation in both the factor and solve phases and further optimizes the use of the network bandwidth to perform I/O operations. Stability of LU factorization is enforced by full column pivoting. Performance results are presented for the Connection Machine system CM--5.
TR-21-94 [tr-21-94.ps.gz (85 K)]
Ted Nesson and Lennart Johnsson. 1994. ROMM Routing: A Class of Efficient Minimal Routing Algorithms.''
ROMM is a class of Randomized, Oblivious, Multi--phase, Minimal routing algorithms. ROMM routing offers a potential for improved performance compared to fully randomized algorithms under both light and heavy loads. ROMM routing also offers close to best case performance for many common permutations. These claims are supported by extensive simulations of binary cube networks for a number of routing patterns. We show that $kimes n$ buffers per node suffice to make $k$--phase ROMM routing free from deadlock and livelock on $n$--dimensional binary cubes.
TR-22-94 [tr-22-94.ps.gz (41 K)]
S. Lennart Johnsson. 1994. Issues in High Performance Computer Networks.''

TR-23-94 [tr-23-94.ps.gz (190 K)]
Alex Fukunaga, Jon Christensen, J. Thomas Ngo, and Joe Marks. 1994. A Comparative Study of Search and Optimization Algorithms for the Automatic Control of Physically Realistic 2-D Animated Figures.''
In the Spacetime Constraints paradigm of animation, the animator specifies what a character should do, and the details of the motion are generated automatically by the computer. Ngo and Marks recently proposed a technique of automatic motion synthesis that uses a massively parallel genetic algorithm to search a space of motion controllers that generate physically realistic motions for 2D articulated figures. In this paper, we describe an empirical study of evolutionary computation algorithms and standard function optimization algorithms that were implemented in lieu of the massively parallel GA in order to find a substantially more efficient search algorithm that would be viable on serial workstations. We discovered that simple search algorithms based on the evolutionary programming paradigm were most efficient in searching the space of motion controllers.
TR-24-94 [tr-24-94.ps.gz (143 K)]
Yu Hu and S. Lennart Johnsson. 1994. Implementing O(N) N-body Algorithms Efficiently in Data Parallel Languages (High Performance Fortran).''
O(N) algorithms for N-body simulations enable the simulation of particle systems with up to 100 million particles on current Massively Parallel Processors (MPPs). Our optimization techniques mainly focus on minimizing the data movement through careful management of the data distribution and the data references, both between the memories of different nodes, and within the memory hierarchy of each node. We show how the techniques can be expressed in languages with an array syntax, such as Connection Machine Fortran (CMF). All CMF constructs used, with one exception, are included in High Performance Fortran.

The effectiveness of our techniques is demonstrated on an implementation of Anderson's hierarchical O(N) N-body method for the Connection Machine system CM-5/5E. Of the total execution time, communication accounts for about 10-20% of the total time, with the average efficiency for arithmetic operations being about 40% and the total efficiency (including communication) being about 35%. For the CM-5E a performance in excess of 60 Mflop/s per node (peak 160 Mflop/s per node) has been measured.

TR-25-94 [tr-25-94.ps.gz (294 K)]
Karen E. Lochbaum. 1994. Using Collaborative Plans to Model the Intentional Structure of Discourse .''
An agent's ability to understand an utterance depends upon its ability to relate that utterance to the preceding discourse. The agent must determine whether the utterance begins a new segment of the discourse, completes the current segment, or contributes to it. The intentional structure of the discourse, comprised of discourse segment purposes and their interrelationships, plays a central role in this process (Grosz and Sidner, 1986). In this thesis, we provide a computational model for recognizing intentional structure and utilizing it in discourse processing. The model specifies how an agent's beliefs about the intentions underlying a discourse affects and are affected by its subsequent discourse. We characterize this process for both interpretation and generation and then provide specific algorithms for modeling the interpretation process.

The collaborative planning framework of SharedPlans (Lochbaum, Grosz, and Sidner, 1990; Grosz and Kraus, 1993) provides the basis for our model of intentional structure. Under this model, agents are taken to engage in discourses and segments of discourses for reasons that derive from the mental state requirements of action and collaboration. Each utterance of a discourse is understood in terms of its contribution to the SharedPlans in which the discourse participants are engaged. We demonstrate that this model satisfies the requirements of Grosz and Sidner's (1986) theory of discourse structure and also simplifies and extends previous plan-based approaches to dialogue understanding. The model has been implemented in a system that demonstrates the contextual role of intentional structure in both interpretation and generation.

TR-26-94 [tr-26-94.ps.gz (163 K)]
Yu Hu and S. Lennart Johnsson. 1994. A Data Parallel Implementation of Hierarchical N-body Methods.''
The O(N) hierarchical N-body algorithms and Massively Parallel Processors allow particle systems of 100 million particles or more to be simulated in acceptable time. We describe a data parallel implementation of Anderson's method and demonstrate both efficiency and scalability of the implementation on the Connection Machine CM-5/5E systems. The communication time for large particle systems amounts to about 10-25%, and the overall efficiency is about 35%. On a CM-5E the overall performance is about 60 Mflop/s per node, independent of the number of nodes.
TR-27-94 [tr-27-94.ps.gz (310 K)]
David Kramer, S. Lennart Johnsson, and Yu Hu. 1994. Local Basic Linear Algebra Subroutines (LBLAS) for the CM--5/5E.''
The Connection Machine Scientific Software Library (CMSSL) is a library of scientific routines designed for distributed memory architectures. The BLAS of the CMSSL have been implemented as a two--level structure to exploit optimizations local to nodes and across nodes. This paper presents the implementation considerations and performance of the Local BLAS, or BLAS local to each node of the system. A wide variety of loop structures and unrollings have been implemented in order to achieve a uniform and high performance, irrespective of the data layout in node memory. The CMSSL is the only existing high--performance library capable of supporting both the data parallel and message passing modes of programming a distributed memory computer. The implications of implementing BLAS on distributed memory computers are considered in this light.
TR-28-94 [tr-28-94.ps.gz (135 K)]
Barbara Grosz, H.T. Kung, Margo Seltzer, Stuart Shieber, and Michael Smith. 1994. Infrastructure for Research towards Ubiquitous Information Systems.''

TR-29-94 [tr-29-94.ps.gz (67 K)]
Lilei Chen. 1994. Automatic Derivation of Parallel and Systolic Programs.''
We present a simple method for developing parallel and systolic programs from data dependence. We derive sequences of parallel computations and communications based on data dependence and communication delays, and minimize the communication delays and processor idle time. The potential applications for this method include supercompiling, automatic development of parallel programs, and systolic array design.
TR-30-94 [tr-30-94.ps.gz (55 K)]
Christopher Small and Margo Seltzer. 1994. VINO: An Integrated Platform for Operating System and Database Research.''
In 1981, Stonebraker wrote:

Operating system services in many existing systems are either too slow or inappropriate. Current DBMSs usually provide their own and make little or no use of those offered by the operating system.

The standard operating system model has changed little since that time, and we believe that, at its core, it is the {\em wrong} model for DBMS and other resource-intensive applications. The standard model is inflexible, uncooperative, and irregular in its treatment of resources.

We describe the design of a new system, the VINO kernel, which addresses the limitations of standard operating systems. It focuses on three key ideas:

- Applications direct policy. - Kernel mechanisms are reusable by applications. - All resources share a common extensible interface.

VINO's power and flexibility make it an ideal platform for the design and implementation of traditional and modern database management systems.

TR-31-94 [tr-31-94.ps.gz (1057 K)]
David Mazieres and Michael D. Smith. 1994. Abstract Execution in a Multi-Tasking Environment.''
Tracing software execution is an important part of understanding system performance. Raw CPU power has been increasing at a rate far greater than memory and I/O bandwidth, with the result that the performance of client/server and I/O-bound applications is not scaling as one might hope. Unfortunately, the behavior of these types of applications is particularly sensitive to the kinds of distortion induced by traditional tracing methods, so that current traces are either incomplete or of questionable accuracy. Abstract execution is a powerful tracing technique which was invented to speed the tracing of single processes and to store trace data more compactly. In this work, abstract execution was extended to trace multi-tasking workloads. The resulting system is more than 5 times faster than other current methods of gathering multi-tasking traces, and can therefore generate traces with far less time distortion.
TR-32-94 [tr-32-94.ps.gz (71 K)]
L.G. Valiant. 1994. Rationality.''

TR-33-94 [tr-33-94.ps.gz (265 K)]
Igor Najfeld and Timothy F. Havel. 1994. Derivatives of the Matrix Exponential and their Computation.''
Matrix exponentials and their derivatives play an important role in the perturbation analysis, control and parameter estimation of linear dynamical systems. The well-known integral representation of the derivative of the matrix exponential $\exp(t{f A})$ in the matrix direction ${f V}$, $\int_0^t{f\exp(({\it t}-au)A)V\exp(au A)}\,{\rm d}au$, enables us to derive a number of new properties of this derivative, along with spectral, series and exact representations. Many of these results extend to arbitrary analytic functions of a matrix argument, for which we have also derived a simple relation between the gradients of their entries and the directional derivatives in the elementary directions. Based on these results, we construct and optimize two new algorithms for computing the directional derivative. We have also developed a new algorithm for computing the matrix exponential, based on a rational representation of the exponential in terms of the hyperbolic function ${f A}{f coth}({f A})$, which is more efficient than direct Pade approximation. Finally, these results are illustrated by an application to a biologically important parameter estimation problem which arises in nuclear magnetic resonance spectroscopy.
TR-34-94 [tr-34-94.ps.gz (140 K)]
Yasuhiro Endo, James Gwertzman, Margo Seltzer, Christopher Small, Keith A. Smith, and Diane Tang. 1994. VINO: The 1994 Fall Harvest.''

TR-35-94 [tr-35-94.ps.gz (54 K)]
Keith Smith and Margo Seltzer. 1994. File Layout and File System Performance.''
Most contemporary implementations of the Berkeley Fast File System optimize file system throughput by allocating logically sequential data to physically contiguous disk blocks. This clustering is effective when there are many contiguous free blocks on the file system. But the repeated creation and deletion of files of varying sizes that occurs over time on active file systems is likely to cause fragmentation of free space, limiting the ability of the file system to allocate data contiguously and therefore degrading performance.

This paper presents empirical data and the analysis of allocation and fragmentation in the SunOS 4.1.3 file system (a derivative of the Berkeley Fast File System). We have collected data from forty-eight file systems on four file servers over a period of ten months. Our data show that small files are more fragmented than large files, with fewer than 35% of the blocks in two block files being allocated optimally, but more than 80% of the blocks in files larger than 256 kilobytes being allocated optimally. Two factors are respon sible for this difference in fragmentation, an uneven distribution of free space within file system cylinder groups and a disk allocation algorithm which frequently allocates the last block of a file discontiguously from the rest of the file.

Performance measurements on replicas of active file systems show that they seldom perform as well as comparable empty file systems but that this performance degradation is rarely more than 10-15%. This decline in performance is directly correlated to the amount of fragmentation in the files used by the benchmark programs. Both file system utilization and the amount of fragmentation in existing files on the file system influence the amount of fragmentation in newly created files. Characteristics of the file system workload also have a significant impact of file system fragmentation and performance, with typical news server workloads causing extreme fragmentation.

TR-36-94 [tr-36-94.ps.gz (64 K)]
Thomas Cheatham, Amr Fahmy, Dan C. Stefanescu, and Leslie G. Valiant. 1994. Bulk Synchronous Parallel Computing -- A Paradigm for Transportable Software.''
A necessary condition for the establishment, on a substantial basis, of a parallel software industry would appear to be the availability of technology for generating transportable software, i.e. architecture independent software which delivers scalable performance for a wide variety of applications on a wide range of multiprocessor computers. This paper describes H-BSP -- a general purpose parallel computing environment for developing transportable algorithms. H-BSP is based on the Bulk Synchronous Parallel Model (BSP), in which a computation involves a number of supersteps, each having several parallel computational threads that synchronize at the end of the superstep. The BSP Model deals explicitly with the notion of communication among computational threads and introduces parameters g and L that quantify the ratio of communication throughput to computation throughput, and the synchronization period, respectively. These two parameters, together with the number of processors and the problem size, are used to quantify the performance and, therefore, the transportability of given classes of algorithms across machines having different values for these parameters. This paper describes the role of unbundled compiler technology in facilitating the development of such a parallel computer environment.