# Harvard Computer Science Technical Reports for 1992

TR-01-92 [tr-01-92.ps.gz (113 K)]
Kapil K. Mathur and S. Lennart Johnsson. 1992. Multiplication of Matrices of Arbitrary Shape on a Data Parallel Computer.''
Some level-2 and level-3 Distributed Basic Linear Algebra Subroutines (DBLAS) that have been implemented on the Connection Machine system CM-200 are described. For matrix-matrix multiplication, both the nonsystolic and the systolic algorithms are outlined. A systolic algorithm that computes the product matrix in-place is described in detail. All algorithms that are presented here are part of the Connection Machine Scientific Software Library, CMSSL. We show that a level-3 DBLAS yields better performance than a level-2 DBLAS. On the Connection Machine system CM-200, blocking yields a performance improvement by a factor of up to three over level-2 DBLAS. For certain matrix shapes the systolic algorithms offer both improved performance and significantly reduced temporary storage requirements compared to the nonsystolic block algorithms. The performance improvement over the blocked nonsystolic algorithms may be as much as a factor of seven, or more than a factor of 20 over the level-2 DBLAS. We show that, in order to minimize the communication time, an algorithm that leaves the largest operand matrix stationary should be chosen for matrix-matrix multiplication. Furthermore, it is shown both analytically and experimentally that the optimum shape of the processor array yields square stationary submatrices in each processor, i.e., the ratio between the length of the axes of the processing array must be the same as the ratio between the corresponding axes of the stationary matrix. The optimum processor array shape may yield a factor of five performance enhancement for the multiplication of square matrices. For rectangular matrices a factor of 30 improvement was observed for an optimum processor array shape compared to a poorly chosen processor array shape.
TR-02-92 [tr-02-92.ps.gz (683 K)]
Zdene k Johan, Thomas J.R. Hughes, Kapil K. Mathur, , and S. Lennart Johnsson. 1992. A Data Parallel Finite Element Method for Computational Fluid Dynamics on the Connection Machine System.''
A finite element method for computational fluid dynamics has been implemented on the Connection Machine systems CM-2 and CM-200. An implicit iterative solution strategy, based on the preconditioned matrix-free GMRES algorithm, is employed. Parallel data structures built on both nodal and elemental sets are used to achieve maximum parallelization. Communication primitives provided through the Connection Machine Scientific Software Library substantially improved the overall performance of the program. Computations of three-dimensional compressible flows using unstructured meshes having close to one million elements, such as a complete airplane, demonstrate that the Connection Machine systems are suitable for these applications. Performance comparisons are also carried out with the vector computers Cray Y- MP and Convex C-1.
TR-03-92
Marios Mavronicolas. 1992. Efficiency of Semi-Synchronous versus Asynchronous Systems: Atomic Shared Memory.''
The {\em $s$-session problem} is studied in {\em asynchronous} and {\em semi-synchronous} shared-memory systems, under a particular shared-memory communication primitive -- $k$-writer, $k$-reader atomic registers, -- where $k$ is a constant reflecting the communication bound in the model. A session is a part of an execution in which each of $n$ processes takes at least one step; an algorithm for the $s$-session problem guarantees the existence of at least $s$ disjoint sessions. The existence of many sessions guarantees a degree of interleaving which is necessary for certain computations. In the asynchronous model, it is assumed that the time between any two consecutive steps of any process is in the interval $[0,1]$; in the semi-synchronous model, the time between any two consecutive steps of any process is in the interval $[c,1]$ for some $c$ such that $0 < c leq 1$, the {\em synchronous} model being the special case where $c=1$. All processes are initially synchronized and take a step at time $0$. Our main result is a tight (within constant factors) lower bound of $1 + \min {lfloor \frac {1} {2c} \rfloor, lfloor log_ {k} (n-1) - 1 \rfloor } (s-2)$ for the time complexity of any semi- synchronous algorithm for the $s$-session problem. This result implies a time separation between semi-synchronous and asynchronous shared memory systems.
TR-04-92 [tr-04-92.ps.gz (115 K)]
Woody Lichtenstein and S. Lennart Johnsson. 1992. Block-Cyclic Dense Linear Algebra.''
Block-cyclic order elimination algorithms for LU and QR factorization and solve routines are described for distributed memory architectures with processing nodes configured as two- dimensional arrays of arbitrary shape. The cyclic order elimination together with a consecutive data allocation yields good load-balance for both the factorization and solution phases for the solution of dense systems of equations by LU and QR decomposition. Blocking may offer a substantial performance enhancement on architectures for which the level-2 or level-3 BLAS are ideal for operations local to a processing node. High rank updates local to a node may have a performance that is a factor of four or more higher than a rank-1 update. We show that in our implementation the \ $O(N^ {2} )$ \ work in the factorization is of the same significance as the \ $O(N^ {3} )$ \ work, even for large matrices, because the \ $O(N^ {2} )$ \ work is poorly load -balanced in the two-dimensional processor array configuration. However, we show that the two-dimensional processor array configuration with consecutive data allocation and block-cyclic order elimination is optimal with respect to communication for a simple, but fairly general communications model. In our Connection Machine system CM-200 implementation, the peak performance for LU factorization is about 9.4 Gflops/s in 64-bit precision and 16 Gflops/s in 32-bit precision. Blocking offers an overall performance enhancement of about a factor of two. For the data motion, use is made of the fact that the nodes along each axis of the two-dimensional array are interconnected as Boolean cubes.
TR-05-92
Marios Mavronicolas and Dan Roth. 1992. Efficient, Strongly Consistent Implementations of Shared Memory.''
We present two distributed organizations of multiprocessor shared memory and develop for them implementations that are shown to satisfy a strong consistency condition, namely {\em linearizability} , achieve improvements in efficiency over previous ones that support even weaker consistency conditions and possess other important, sought after properties that make them practically attractive. It is assumed throughout this paper that processes have clocks that run at the same rate as real time and all messages incur a delay in the range $[d-u,d]$ for some known constants $u$ and $d$, $0 leq u leq d$. The efficiency of an implementation is measured by the {\em worst-case response time} for performing an operation on an object. For the {\em full caching} organization, where each process keeps local copies of all objects, we present the first efficient linearizable implementation of read/write objects. The family of linearizable implementations we present is parameterized in a way that allows to degrade the less frequently employed operation, and is shown to be essentially optimal. For the {\em single ownership} organization, each shared object is owned'' by a single process, which is most likely to access it frequently. We present an implementation, that allows a process to access local information much faster (almost instantaneously) than it can access remote information, while still supporting linearizability. While the cost of the global operations depends on the maximal message delay $d$, the cost of the local operations depends only on the message delay uncertainty $u$. In both implementations, decisions made by individual processes do not make use of any communicated timing information. In particular, timing information is not part of the messages passed by our protocols, and those are of bounded size. These two organizations can be combined in a hierarchical memory structure, which supports linearizability very efficiently; this hybrid structure allows processes to access local and remote information in a transparent manner, while at a lower level of the memory consistency system, different portions of the memory, allocated a priory according to anticipated remote versus local use of the objects, employ the suitable, full caching or single ownership implementation.
TR-06-92
Robert Muller. 1992. Lecture Notes on Domain Theory.''
This report contains a collection of lecture notes for a series of lectures introducing Scott's {\em domain theory} . The basic structure of semantic domains and fixpoint theory are introduced. The lecture notes are not intended to serve as a primary reference but rather as a supplement to a more comprehensive treatment.
TR-07-92 [tr-07-92.ps.gz (72 K)]
Alan Edelman, Steve Heller, and S. Lennart Johnsson. 1992. Index Transformation Algorithms in a Linear Algebra Framework.''
We present a linear algebraic formulation for a class of index transformations such as Gray code encoding and decoding, matrix transpose, bit reversal, vector reversal, shuffles, and other index or dimension permutations. This formulation unifies, simplifies, and can be used to derive algorithms for hypercube multiprocessors. We show how all the widely known properties of Gray codes, and some not so well-known properties as well, can be derived using this framework. Using this framework, we relate hypercube communications algorithms to Gauss-Jordan elimination on a matrix of 0's and l's.
TR-08-92 [tr-08-92.ps.gz (90 K)]
Yves Schabes and Stuart M. Shieber. 1992. An Alternative Conception of Tree-Adjoining Derivation.''
The precise formulation of derivation for tree-adjoining grammars has important ramifications for a wide variety of uses of the formalism, from syntactic analysis to semantic interpretation and statistical language modeling. We argue that the definition of tree-adjoining derivation must be reformulated in order to manifest the proper linguistic dependencies in derivations. The particular proposal is both precisely characterizable, through a compilation to linear indexed grammars, and computationally operational, by virtue of an efficient algorithm for recognition and parsing.
TR-09-92 [tr-09-92.ps.gz (119 K)]
S. Lennart Johnsson and Luis F. Ortiz. 1992. Local Basic Linear Algebra Subroutines (LBLAS) for Distributed Memory Architectures and Languages with Array Syntax.''
We describe a subset of the level-1, level-2, and level-3 BLAS implemented for each node of the Connection Machine system CM-200 and with a set of interfaces consistent with Fortran 90. The implementation performs computations on multiple instances in a single call to a routine. The strides for the different axes are derived from an array descriptor that contains information about the length of the axes, the number of instances and their allocation in the machine. Another novel feature of our implementation of the BLAS in each node is a selection of loop order for rank-1 updates and matrix-matrix multiplication based upon array shapes, strides, and DRAM page faults. The peak efficiencies for the routines are in the range 75% to 90%. The optimization of loop ordering has a success rate exceeding 99.8% for matrices for which the sum of the length of the axes is at most 60. The success rate is even higher for all possible matrix shapes. The performance loss when a nonoptimal choice is made is less than $\sim$15% of peak, and typically less l% of peak. We also show that the performance gain for high rank updates may be as much as a factor of 6 over rank-1 updates.
TR-10-92 [tr-10-92.ps.gz (80 K)]
Alexandros V. Gerbessiotis and Leslie G. Valiant. 1992. Direct Bulk-Synchronous Parallel Algorithms.''
We describe a methodology for constructing parallel algorithms that are transportable among parallel computers having different numbers of processors, different bandwidths of interprocessor communication and different periodicity of global synchronization. We do this for the bulk-synchronous parallel (BSP) model, which abstracts the characteristics of a parallel machine into three numerical parameters $p$, $g$, and $L$, corresponding to processors, bandwidth, and periodicity respectively. The model differentiates memory that is local to a processor from that which is not, but, for the sake of universality, does not differentiate network proximity. The advantages of this model in supporting shared memory or PRAM style programming have been treated elsewhere. Here we emphasize the viability of an alternative direct style of programming where, for the sake of efficiency the programmer retains control of memory allocation. We show that optimality to within a multiplicative factor close to one can be achieved for the problems of Gauss-Jordan elimination and sorting, by transportable algorithms that can be applied for a wide range of values of the parameters $p$, $g$, and $L$. We also give some simulation results for PRAMs on the BSP to identify the level of slack at which corresponding efficiencies can be approached by shared memory simulations, provided the bandwidth parameter $g$ is good enough.
TR-11-92
Lilei Chen. 1992. Deriving Parallel and Systolic Programs from Data Dependence.''
We present an algorithm that statically sequences data computations and communications for parallel and systolic executions. Instead of searching for implicit parallelism in a functional or sequential program, the algorithm looks for sequence requirements imposed by the data dependence and by the communication delays. It achieves its efficiency by analyzing the sequence constraints. The actual sequence of the parallel computations can be decided at the last stage to tailor to the specific parallel machine. In addition, once the processor mapping is decided, the data communication delay is combined with the computation sequence to obtain the final scheduling of the computations and communications. As a result, the parallel implementation fully explores the parallelism in the original program and effectively schedules the computations and minimizes the communication cost by systolic design. Because the algorithm is only based on the data dependence from the original program, it can be applied to a wide variety of program forms, from sequential loop programs with updates, to recursive equation sets. It can detect parallelism in sequential programs, as well as provide efficient implementations for recurrence statements in equation sets.
TR-12-92 [tr-12-92.ps.gz (262 K)]
Cesar Alejandro Galindo-Legaria. 1992. Algebraic Optimization of Outerjoin Queries.''
The purpose of this thesis is to extend database optimization techniques for joins to queries that contain both joins and outerjoins. The benefits of query optimization are thus extended to a number of important applications, such as federated databases, nested queries, and hierarchical views, for which outerjoin is a key component. Our analysis of join/outerjoin queries is done in two parts. First, we investigate the interaction of outerjoin with other relational operators, to find simplification rules and associativity identities. Our approach is comprehensive and includes, as special cases, some outerjoin optimization heuristics that have appeared in the literature. Second, we abstract the notion of feasible evaluation order for binary, join-like operators, considering associativity rules but not specific operator semantics. Combining these two parts, we show that a join/outerjoin query can be evaluated by combining relations in any given order ---just as it is done on join queries, except that now we need to synthesize an operator to use at each step, rather than using always join. The purpose of changing the order of processing of relations is to reduce the size of intermediate results. Our theoretical results are converted into algorithms compatible with the architecture of conventional database optimizers. For optimizers that repeatedly transform feasible strategies, the outerjoin identities we have identified can be applied directly. Those identities are sufficient to obtain all possible orders of processing. For optimizers that generate join programs bottom-up, we give a rule to determine the operators to use at each step.
TR-13-92
Marios Mavronicolas. 1992. Timing-Based, Distributed Computation: Algorithms and Impossibility Results.''
Real distributed systems are subject to timing uncertainties: processes may lack a common notion of real time, or may even have only inexact information about the amount of real time needed for performing primitive computation steps. In this thesis, we embark on a study of the complexity theory of such systems and present combinatorial results that determine the inherent costs of some accomplishable tasks. We first consider {\em continuous-time} models, where processes obtain timing information from continuous- time clocks that run at the same rate as real time, but might not be initially synchronized. Due to an uncertainty in message delay time, absolute process synchronization is known to be impossible for such systems. We develop novel synchronization schemes for such systems and use them for building a distributed, {\em full caching} implementation of shared memory that supports {\em linearizability} . This implementation improves in efficiency over previous ones that support consistency conditions even weaker than linearizability and supports a quantitative degradation of the less frequently occurring operation. We present lower bound results which show that our implementation achieves efficiency close to optimal. We next turn to {\em discrete-time} models, where the time between any two consecutive steps of a process is in the interval $[c,1]$, for some constant $c$ such that $0 leq c leq 1$. We show time separation results between {\em asynchronous} and {\em semi-synchronous} such models, defined by taking $c=0$ and $c > 0$, respectively. Specifically, we use the {\em session problem} to show that the semi-synchronous model, for which the timing uncertainty, $\frac {1} {c}$, is bounded, is strictly more powerful than the asynchronous one under either message-passing or shared-memory interprocess communication. We also present tight lower and upper bounds on the degree of {\em precision} that can be achieved in the semi-synchronous model. Our combinatorial results shed some light on the capabilities and limitations of distributed systems subject to timing uncertainties. In particular, the main argument of this thesis is that the goal of designing distributed algorithms so that their logical correctness is timing-independent, whereas their performance might depend on timing assumptions, will not always be achievable: for some tasks, the only practical solutions might be strongly timing-dependent.
TR-14-92
Marios Mavronicolas. 1992. An Upper and a Lower Bound for Tick Synchronization.''
The {\em tick synchronization problem} is defined and studied in the {\em semi-synchronous} network, where $n$ processes are located at the nodes of a complete graph and communicate by sending messages along links that correspond to its edges. An algorithm for the tick synchronization problem brings each process into a synchronized state in which the process makes an estimate of real time that is close enough to those of other processes already in a synchronized state. It is assumed that the (real) time for message delivery is at most $d$ and the time between any two consecutive steps of any process is in the interval $[c,1]$ for some $c$ such that $0 < c leq 1$. We define the {\em precision} of a tick synchronization algorithm to be the maximum difference between estimates of real time made by different processes in a synchronized state, and propose it as a worst-case performance measure. We show that no such algorithm can guarantee precision less than $lfloor \frac {d-2} {2c} \rfloor$. We also present an algorithm which achieves a precision of $\frac {2(n-1)} {n} (lceil \frac {2d} {c} \rceil + \frac {d} {2} ) + \frac {1-c} {c} d+1$.
TR-15-92
Thomas Raysor Hancock. 1992. The Complexity of Learning Formulas and Decision Trees that have Restricted Reads.''
Many learning problems can be phrased in terms of finding a close approximation to some unknown target formula $f$, based on observing $f$'s value on a sample of points either drawn at random according to some underlying distribution, or perhaps selected by a learner for algorithmic reasons. In this research our goal is to prove theorems about what classes of formulas permit such learning in polynomial time (using the definitions of either Valiant's PAC model or Angluin's exact identification model). In particular we take powerful classes of formulas whose learnability is unknown or provably intractable, and then consider restricted cases where the number of different times a single variable may appear in the formula is limited to a small constant. We prove positive learnability results in several such cases, given either added assumptions on the underlying distribution of random points or the ability of the learner to select some of the sample points. We provide polynomial time learning algorithms for decision trees and monotone disjunctive normal form (DNF) formulas when variables appear at most some arbitrary constant number of times, given that the sample points are chosen uniformly. Over arbitrary distributions, we show algorithms that chose their own sample points, besides using random examples, to closely approximate the same class of decision trees and the class of DNF formulas where variables appear at most twice. For arbitrary formulas, we give a number of algorithms for the read-once case (where variables appear only once) over different bases (the functions computed at formula's nodes). Besides identification algorithms for large classes of boolean read-once formulas, these results include new interpolation algorithms for classes of rational functions, and a membership query algorithm for a new class of neural networks.
TR-16-92 [tr-16-92.ps.gz (120 K)]
S. Lennart Johnsson and Ching-Tien Ho. 1992. Optimal Communication Channel Utilization for Matrix Transposition and Related Permutations on Binary Cubes.''
We present optimal schedules for permutations in which each node sends one or several unique messages to every other node. With concurrent communication on all channels of every node in binary cube networks, the number of element transfers in sequence for \ $K$ \ elements per node is \ ${K over 2} ,$ \ irrespective of the number of nodes over which the data set is distributed. For a succession of \ $s$ \ permutations within disjoint subcubes of \ $d$ \ dimensions each, our schedules yield \ $\min ( {K over 2} + (s - 1)d,(s + 3)d, {K over 2} + 2 d)$ \ exchanges in sequence. The algorithms can be organized to avoid indirect addressing in the internode data exchanges, a property that increases the performance on some architectures. For message passing communication libraries, we present a blocking procedure that minimizes the number of block transfers while preserving the utilization of the communication channels. For schedules with optimal channel utilization, the number of block transfers for a binary \ $d-$cube is \ $d.$ The maximum block size for \ $K$ \ elements per node is \ $lceil {K over 2d} \rceil.$
TR-17-92
Michael Francis Kilian. 1992. Parallel Sets: An Object-Oriented Methodology for Massively Parallel Programming.''
Parallel programming has become the focus of much research in the past decade. As the limits of VLSI technology are tested, it becomes more apparent that parallel processors will be responsible for the next quantum leap in performance. Already parallel programming is responsible for significant advances not so much in the speed of solving problems, but in the size of problems that can be solved. Carefully crafted parallel programs are solving problems magnitudes larger than could be considered for serial machines. Object-oriented programming has also become popular in academia and perhaps even moreso in industry. O-O holds out the promise of being able to efficiently build large systems that are understandable, maintainable, and more robust. The programs targetted by O-O are different than those typically found running on a computer such as the Connection Machine. Parallel programs are often designed for very specific tasks; O-O programs' strengths are that they handle a wide variety of requirements. The thesis proposed here is that an object-oriented model of programming can be developed that is suitable for massively parallel processors. A set of criteria are developed for object- oriented parallel programming models and existing models are evaluated using this criteria. Given these criteria, the thesis presents a new way of thinking of parallel programs that builds upon an object-oriented foundation. A new basic type is added to the object model called Parallel-Set. Parallel sets are rigorously defined and then used to express complex communication between objects. The communication model is then extended to allow communication and synchronization protocols to be developed. The contribution of this work is that a wider range of reliable programs can be designed for use on parallel computers and that these programs will be easier to construct and understand.
TR-18-92 [tr-18-92.ps.gz (134 K)]
S. Lennart Johnsson. 1992. Language and Compiler Issues in Scalable High Performance Scientific Libraries.''
Library functions for scalable architectures must be designed to correctly and efficiently support any distributed data structure that can be created with the supported languages and associated compiler directives. Libraries must be designed also to support concurrency in each function evaluation, as well as the concurrent application of the functions to disjoint array segments, known as {\em multiple-instance} computation. Control over the data distribution is often critical for locality of reference, and so is the control over the interprocessor data motion. Scalability, while preserving efficiency, implies that the data distribution, the data motion, and the scheduling is adapted to the object shapes, the machine configuration, and the size of the objects relative to the machine size. The Connection Machine Scientific Software Library is a scalable library for distributed data structures. The library is designed for languages with an array syntax. It is accessible from all supported languages ($\ast$Lisp, C$\ast$,CM-Fortran, and Paris (PARalel Instruction Set) in combination with Lisp, C, and Fortran 77). Single library calls can manage both concurrent application of a function to disjoint array segments, as well as concurrency in each application of a function. The control of the concurrency is independent of the control constructs provided in the high-level languages. Library functions operate efficiently on any distributed data structure that can be defined in the high-level languages and associated directives. Routines may use their own internal data distribution for efficiency reasons. The algorithm invoked by a call to a library function depends upon the shapes of the objects involved, their sizes and distribution, and upon the machine shape and size.
TR-19-92 [tr-19-92.ps.gz (55 K)]
Stuart M. Shieber. 1992. Lessons from a Restricted Turing Test.''
We report on the recent Loebner prize competition inspired by Turing's test of intelligent behavior. The presentation covers the structure of the competition and the outcome of its first instantiation in an actual event, and an analysis of the purpose, design, and appropriateness of such a competition. We argue that the competition has no clear purpose, that its design prevents any useful outcome, and that such a competition is inappropriate given the current level of technology. We then speculate as to suitable alternatives to the Loebner prize.
TR-20-92 [tr-20-92.ps.gz (150 K)]
Ching-Tien Ho, M.T. Raghunath, and S. Lennart Johnsson. 1992. An Efficient Algorithm for Gray-to-Binary Permutation on Hypercubes.''
Both Gray code and binary code are frequently used in mapping arrays into hypercube architectures. While the former is preferred when communication between adjacent array elements is needed, the latter is preferred for FFT-type communication. When different phases of computations have different types of communication patterns, the need arises to remap the data. We give a nearly optimal algorithm for permuting data from a Gray code mapping to a binary code mapping on a hypercube with communication restricted to one input and one output channel per node at a time. Our algorithm improves over the best previously known algorithm [6] by nearly a factor of two and is optimal to within a factor of \ $n/(n-1)$ \ with respect to data transfer time on an \ $n$-cube. The expected speedup is confirmed by measurements on an Intel iPSC/2 hypercube.
TR-21-92
J. Thomas Ngo and Joe Marks. 1992. Physically Realistic Trajectory Planning in Animation: A Stimulus-Response Approach.''
Trajectory-planning problems arise in subject to physical law other constraints on their motion. Witkin Kass dubbed this class of problems Spacetime Constraints'' (SC) presented results for specific problems involving an articulated f SC problems are typically multimodal discontinuous the number of decision alternatives available at each time step ca constructing even coarse trajectories for subsequent optimization without directive input from the user. Rather than use a time-dom which might be appropriate for local optimization our algorithm uses a stimulus-response model. Locomotive skills a which chooses stimulus-response parameters using a parallel geneti succeeds in finding good novel solutions for a test suite of SC problems involving unbranch
TR-22-92
Peter C. Gordon, Barbara J. Grosz, and Laura A. Gilliom. 1992. Pronouns, Names, and the Centering of Attention in Discourse.''
Centering theory, developed within computational linguistics, provides an account of ways in which patterns of inter-utterance reference can promote the local coherence of discourse. It states that each utterance in a coherent discourse segment contains a single semantic entity -- the backward-looking center -- that provides a link to the previous utterance, and an ordered set of entities -- the forward-looking centers -- that offer potential links to the next utterance. We report five reading-time experiments that test predictions of this theory with respect to the conditions under which it is preferable to realize (refer to) an entity using a pronoun rather than a repeated definite description or name. The experiments show that there is a single backward-looking center that is preferentially realized as a pronoun, and that the backward- looking center is typically realized as the grammatical subject of the utterance. They also provide evidence that there is a set of forward-looking centers that is ranked in terms of prominence and that a key factor in determining prominence, surface-initial position, does not affect determination of the backward-looking center. This provides evidence for the dissociation of the coherence processes of looking backward and looking forward.
TR-23-92 [tr-23-92.ps.gz (216 K)]
Kapil K. Mathur and S. Lennart Johnsson. 1992. Communication Primitives for Unstructured Finite Element Simulations on Data Parallel Architectures.''
Efficient data motion is critical for high performance computing on distributed memory architectures. The value of some techniques for efficient data motion is illustrated by identifying generic communication primitives. Further, the efficiency of these primitives is demonstrated on three different applications using the finite element method for unstructured grids and sparse solvers with different communication requirements. For the applications presented, the techniques advocated reduced the communication times by a factor of between 1.5 - 3.
TR-24-92 [tr-24-92.ps.gz (67 K)]
Leslie G. Valiant. 1992. A Combining Mechanism for Parallel Computers.''
In a multiprocessor computer communication among the components may be based either on a simple router, which delivers messages point-to-point like a mail service, or on a more elaborate combining network that, in return for a greater investment in hardware, can combine messages to the same address prior to delivery. This paper describes a mechanism for recirculating messages in a simple router so that the added functionality of a combining network, for arbitrary access patterns, can be achieved by it with reasonable efficiency. The method brings together the messages with the same destination address in more than one stage, and at a set of components that is determined by a hash function and decreases in number at each stage.
TR-25-92 [tr-25-92.ps.gz (282 K)]
Jon Christensen, Joe Marks, and Stuart Shieber. 1992. Labeling Point Features on Maps and Diagrams.''
(Revised 6/94; includes color figures.) A major factor affecting the clarity of graphical displays that include text labels is the degree to which labels obscure display features (including other labels) as a result of spatial overlap. Point-feature label placement (PFLP) is the problem of placing text labels adjacent to point features on a map or diagram so as to maximize legibility. This problem occurs frequently in the production of many types of informational graphics, though it arises most often in automated cartography. In this paper we present a comprehensive treatment of the PFLP problem, viewed as a type of combinatorial optimization problem. Complexity analysis reveals that the basic PFLP problem and most interesting variants of it are NP-hard. These negative results help inform a survey of previously reported algorithms for PFLP; not surprisingly, all such algorithms either have exponential time complexity or are incomplete. To solve the PFLP problem in practice, then, we must rely on good heuristic methods. We pro pose two new methods, one based on a discrete form of gradient descent, the other on simulated annealing, and report on a series of empirical tests comparing these and the other known algorithms for the problem. Based on this study, the first to be conducted, we identify the best approaches as a function of available computation time.
TR-26-92
L.G. Valiant. 1992. Why BSP Computers?.''

TR-27-92 [tr-27-92.ps.gz (84 K)]
Stuart M. Shieber and Mark Johnson. 1992. Variations on Incremental Interpretation.''

TR-28-92
Yuli Zhou. 1992. A Fixpoint Theory of Nonmonotonic Functions and Its Applications to Logic Programs, Deductive Databases and.''
In this thesis we shall employ denotational (fixpoint) methods to study the computations of rule systems based on first order logic. The resulting theory parallels and further strengthens the fixpoint theory of {\it stratified logic programs} developed by Apt, Blair and Walker, and we shall consider two principal applications of the theory to logic programs and to production rule systems.
TR-29-92 [tr-29-92.ps.gz (90 K)]
S. Lennart Johnsson. 1992. Massively Parallel Computing: Data distribution and communication.''
We discuss some techniques for preserving locality of reference in index spaces when mapped to memory units in a distributed memory architecture. In particular, we discuss the use of multidimensional address spaces instead of linearized address spaces, partitioning of irregular grids, and placement of partitions among nodes. We also discuss a set of communication primitives we have found very useful on the Connection Machine systems in implementing scientific and engineering applications. We briefly review some of the techniques used to fully utilize the bandwidth of the binary cube network of the CM-2 and CM-200, and give some performance data from implementations of commumication primitives.
TR-30-92
Christos Ioannis Kaklamanis. 1992. Optimal Computing on Mesh-Connected Processor Arrays.''
In this thesis, we present and analyze new algorithms for routing, sorting and dynamic searching on mesh-connected arrays of processors; also we present a lower bound concerning embeddings on faulty arrays. In particular, we first consider the problem of permutation routing in two- and three-dimensional mesh-connected processor arrays. We present new on-line and off-line routing algorithms, all of which are optimal to within a small additive term. Then, we show that sorting an input of size $N = n^2$ can be performed by an $n imes n$ mesh-connected processor array in $2n + o(n)$ parallel communication steps and using constant-size queues, with high probability. This result is optimal to within a low order additive term, realizing the obvious diameter lower bound. Our techniques can be applied to higher dimensional meshes as well as torus-connected networks, achieving significantly better bounds than the known results. Furthermore, we investigate the parallel complexity of the backtrack and branch-and-bound search on the mesh-connected array. We present an $Omega(\sqrt {dN} /\sqrt {log N} )$ lower bound for the time needed by a {\em randomized} algorithm to perform backtrack and branch-and-bound search of a tree of depth $d$ on the $\sqrt {N} imes \sqrt {N}$ mesh, even when the depth of the tree is known in advance. For the upper bounds we give {\em deterministic} algorithms that are within a factor of $O(log^ {{3} over {2}} N)$ from our lower bound. Our algorithms do not make any assumption on the shape of the tree to be searched. Our algorithm for branch-and-bound is the first algorithm that performs branch-and-bound search on a sparse network. Both the lower and the upper bounds extend to higher dimension meshes.
TR-31-92
Zena Matilde Ariola. 1992. An Algebraic Approach to the Compilation and Operational Semantics of Functional Languages with I-structures.''
Modern languages are too complex to be given direct operational semantics. For example, the operational semantics of functional languages has traditionally been given by translating them to the $lambda$-calculus extended with constants. Compilers do a similar translation into an intermediate form in the process of generating code for a machine. A compiler then performs optimizations on this intermediate form before generating machine code. In this thesis we show that the intermediate form can actually be the kernel language. In fact, we may translate the kernel language into still lower-level language(s), where more machine oriented or efficiency related concerns can be expressed directly. Furthermore, compiler optimizations may be expressed as source-to-source transformations on the intermediate languages. We introduce two implicitly parallel languages, Kid (Kernel Id) and P-TAC (Parallel Three Address Code), respectively, and describe the compilation process of Id in terms of a translation of Id into Kid, and of Kid into P-TAC\@. In this thesis we do not describe the compilation process below the P-TAC level. However, we show that our compilation process allows the formalization of questions related to the correctness of the optimizations. We also give the operational semantics of Id indirectly by its translation into Kid and a well-defined operational semantics for Kid. Kid and P-TAC are examples of Graph Rewriting Systems (GRSs), which are introduced to capture sharing of computation precisely. Sharing of subexpressions is important both semantically (e.g., to model side-effects) and pragmatically (e.g., to reason about complexity). Our GRSs extend Barendregt's Term Graph Rewriting Systems to include cyclic graphs and cyclic rules. We present a term model for GRSs along the lines of Levy's term model for $lambda$-calculus, and show its application to compiler optimizations. We also show that GRS reduction is a correct implementation of term rewriting.