%R TR-01-87
%D 1987
%T A Very Simple Construction for Atomic Multiwriter Register
%A Ming Li
%A Paul M.B. Vitanyi
%X This paper introduces a new and conceptually very simple algorithm
to implement an atomic {\it n} -reader {\it n} -writer variable
directly from atomic 1-reader 1-writer variables, using bounded
tags. The algorithm is developed top-down from the unbounded tag
method in [VA]. This is the first direct such construction, and
considerably improves the complexity of all known compound
constructions. The algorithm uses new techniques, but its main
virtue is that it is {\it conceptually very simple and easily
proved correct.}
%R TR-02-87
%D 1987
%T Efficient Dispersal of Information for Security Load Balancing and
Fault Tolerance.
%A Michael O. Rabin
%X We develop an Information Dispersal Algorithm (IDA) which breaks a
file $F$ of length $L=\mid F\mid$ into $n$ pieces of $F_ {i} , 1
\leq i \leq n$, each of length $\mid F_ {i} \mid=L/m$ , so that
every $m$ pieces suffice for reconstructing $F$. Dispersal and
reconstruction are computationally efficient. The sum of the
lengths $\mid F_ {i} \mid$ is $(n/m) \cdot L$. Since $n/m$ can be
chosen to be close to $1$, the IDA is space efficient. IDA has
numerous applications to secure and reliable storage of
information in computer networks and even on single disks, to
fault-tolerant and efficient transmission of information in
networks, and to communications between processors in parallel
computers. For the latter problem we get provably time efficient
and highly fault-tolerant routing on the $n$ - cube, using just
constant size buffers.
%R TR-03-87
%D 1987
%T Learning in the Presence of Malicious Errors.
%A Michael Kearns
%A Ming Li
%X We study an extension to the Valiant model of machine learning
from examples in which errors may be present in the sample data.
We give strong bounds on the rate of error tolerable when the
errors are of a ``malicious'' nature, and show that it is crucial
to take both positive and negative examples when such errors
exist. Quantitative comparisons between the malicious model of
errors and a model of uniform random noise introduced by Angluin
and Laird are made. A greedy heuristic for a new generalization
of set covering that is of independent interest is given, with
nontrivial applications to learning with errors. We also give a
reduction from learning to set cover.
%R TR-04-87
%D 1987
%T Two Phase Gossip:\\ Managing Distributed Event Histories.
%A Abdelsalam Heddaya
%A Meichun Hsu
%A William Weihl
%X We describe a distributed protocol that operates on replicated
data objects---of arbitrary abstract types---represented in terms
of the history of object events, rather than in terms of object
values. In general, a site that stores a replica of an object has
only partial knowledge of the object's event history. We call
such a replica an {\it object representative} . The goal of the
protocol is to limit the sizes of the histories by checkpointing
and discarding old events. We treat separately the three
functions of (1) propagating events to sites that do not know
about them, (2) checkpointing, and (3) discarding old events. A
site rolls forward its checkpoint state as far as it {\it knows}
its local version of the object's history to be complete, but it
discards old events only as far as it knows that all other sites
{\it know} their local histories to be complete. Our protocol
propagates events among sites in {\it gossip} messages exchanged
in the background in a {\it two phase} manner. Each site
maintains several timestamp vectors indicating the extent of its
knowledge of other sites' knowledge of events. These timestamp
vectors are included in gossip messages and used to decide the
extent of global-completeness of each site's local version of the
event history. We formally define the correctness criteria for
our protocol in terms of the completeness properties of histories.
%R TR-05-87
%D 1987
%T An Integrated Toolkit for Operating System Security.
%A Michael O. Rabin
%A J.D. Tygar
%R TR-06-87
%D 1987
%T On the influence of single participant in coin flipping schemes.
%A Benny Chor
%A Mih\`{a}ly Ger\`{e}b-Graus
%X We prove that in a one round fair coin flipping scheme with {\it
n} participants, either the {\it average} influence of all
participants is at least $3/n - o(l/n),$ or there is at least one
participant whose influence is $\Omega \left( n^ {-5/6} \right).$
%R TR-07-87
%D 1987
%T The Complexity of Parallel Comparison Merging.
%A Mih\`{a}ly Ger\`{e}b-Graus
%A Danny Krizanc
%X We prove a worst case lower bound of $\Omega(log log n)$ for
randomized algorithms merging two sorted lists of length $n$ in
parallel using $n$ processors on Valiant's parallel computation
tree model. We show how to strengthen this result to a lower
bound for the expected time taken by any algorithm on the uniform
distribution. Finally, bounds are given for the average time
required for the problem when the number of processors is less
than and greater than $n$.
%R TR-08-87
%D 1987
%T The Approximation of the Permanent is Still Open\\ - A Flaw in
Broader's Proof-
%A Milena Mihail
%X In [B] Broder claims to have obtained a polynomial time randomized
algorithm for approximating the permanent of dense 0-1 matrices.
In this note we point out a major flaw in the analysis of the
algorithm: the process C2 defined in [B] as a coupling is in fact
{\it not} such. Thus, it remains open whether Broder's algorithm
gives satisfactory estimates.
%R TR-09-87
%D 1987
%T $k+1$ Heads are Better Than $k$ for PDA's
%A Marek Chrobak
%A Ming Li
%X We prove the following conjecture stated by Harrison and Ibarra in
1968 [HI, p.462]: There are languages accepted by $(k$+1)-head 1-
way deterministic pushdown automata $((k$+1)-DPDA) but not by $k$-
head 1-way pushdown automata $(k$-PDA), for every $k$. (Partial
solutions for this conjecture can be found in [M1,M2,C].) On the
assumption that their conjecture holds, [HI] also derived some
other consequences. Now all those consequences become theorems.
For example, the class of languages accepted by $k$-PDA's is not
closed under intersection and complementation. Several other
interesting consequences also follow: CFL$\not\subseteq\cup_ {k}
$DPDA($k$) and FA(2)$\not\subseteq\cup_ {k} $DPDA($k$), where
DPDA($k$)=\ {L$\mid$L is accepted by a $k$-DPDA} and FA(2)=\
{L$\mid$L is accepted by 2-head FA} . Our proof is
constructive (that is, not based on diagonalization). Before, the
``$k$+1 versus $k$ heads'' problem was solved by diagonalization
and translation methods [I2,M2,M3, M4,S] for stronger machines (2
-way, etc), and by traditional counting arguments [S2,IK,YR,M1]
for weaker machines ($k$-FA-$k$-head counter machines, etc).
%R TR-10-87
%D 1987
%T Tape versus Queue and Stacks:\\ The Lower Bounds
%A Ming Li
%A Paul M.B. Vit\`{e}nyi
%X Several new optimal or nearly optimal lower bounds are derived on
the time needed to simulate queue, stacks (stack = pushdown
store) and tapes by one off-line single-head tape-unit with one-
way input, for both the deterministic case and the
nondeterministic case. The techniques rely on algorithmic
information theory (Kolmogorov complexity).
%R TR-11-87
%D 1987
%T Plans for Discourse
%A Barbara J. Grosz
%A Candace L. Sidner
%R TR-12-87
%D 1987
%T Oblivious Secret Computation
%A Donald Beaver
%X Recently, methods for secret, distributed, and fault-tolerant
computation have been developed. Those techniques show how to
``compile'' a function to produce a protocol in which a
distributed system evaluates that function at secret arguments,
revealing the value of the function but not the values of the
arguments, nor the values in intermediate steps of the
computation. We extend these methods in two ways: first, we show
that the {\it function} itself need not be revealed. That is, we
show how to evaluate a function secretly, without revealing any
information about the function itself, other than a bound on the
time and space needed to compute it. This result has fundamental
implications for distributed system security and fault-tolerant
computing. Secondly, we show that revealing the {\it result} of a
secret computation is not necessary, and we give useful
applications in which the values of secretly computed functions
are never revealed. One such application extends the ideas of
oblivious transfer to the distributed environment. Using this
extension, we solve classical, two-party oblivious transfer {\it
without using cryptography} , unlike all previous solutions.
%R TR-13-87
%D 1987
%T Polynomially Sized Boolean Circuits Are\\ Not Learnable
%A Donald Beaver
%X Polynomial-sized boolean circuits provide a powerful and general
framework to describe many naturally occurring concepts. Because
of the wide range of functions which can be described or computed
by boolean circuits, an algorithm to learn arbitrary polynomial-
sized circuits would have broad implications for artificial
intelligence and learning theory. We prove, however, a conjecture
made by Valiant that, if one-way functions exist, then the class
of polynomial-sized boolean circuits is not probably-approximately
learnable. We also discuss how this result might be strengthened
by eliminating the assumption of one-way functions.
%R TR-14-87
%D 1987
%T Oblivious Routing with Limited Buffer Capacity
%A Danny Krizanc
%X The problem of oblivious routing in fixed connection networks with
a limited amount of space available to buffer packets is studied.
We show that for a $n$ processor network with a constant number of
connections and a constant number of buffers any deterministic
pure source-oblivious strategy realizing all partial permutations
requires $\Omega(n)$ time. The consequence of this result for
well-known networks is discussed.
%R TR-01-88
%D 1988
%T Bounded Time-Stamps
%A Amos Israeli
%A Ming Li
%X Time-stamps are labels which a system adds to its data items.
These labels enable the system to keep track of temporal
precedence relation among its data elements. Traditionally time-
stamps are used as unbounded numbers and inevitable overflows
cause a loss of this precedence relation. In this paper we develop
a theory {\it bounded time-stamps} . Time-stamp systems are
defined and the complexity of their implementation is fully
analyzed. This theory gives a general tool for converting time-
stamp based protocols to bounded protocols. The power of this
theory is demonstrated by a novel, conceptually simple, protocol
for a multi-writer atomic register, as well as by proving, for the
first time, a non-trivial lower bound for such a register.
%R TR-02-88
%D 1988
%T Two Decades of Applied Kolmogorov Complexity
%A Ming Li
%A Paul M.B. Vitanyi
%X This exposition is a survey of elegant and useful applications of
Kolmogorov complexity. We distinguish three areas: I) Application
of the fact that some strings are compressible. This includes a
strong version of G\" {o} del's incompleteness theorem. II) Lower
bound arguments which rest on application of the fact that certain
strings cannot be compressed at all. Applications range from
Turing machines to electronic chips. III) Other issues. For
instance, the foundations of Probability Theory, a priori
probability, and resource-bounded Kolmogorov complexity.
Applications range from NP-completeness to inductive inference in
Artificial Intelligence.
%R TR-03-88
%D 1988
%T Hybrid Beam-Ray Tracing
%A Joe Marks
%A Robert J. Walsh
%A Mark Friedell
%X Ray tracing is the most accurate, but unfortunately also the most
expensive of current rendering techniques. Beam tracing,
suggested by Heckbert and Hanrahan, is a generalization of ray
tracing that exploits area coherence to trace multiple rays in
parallel. We present a new approach to beam tracing derived from
the area-subdivision technique of Warnock's hidden-surface
algorithm. We use this approach to beam tracing as the basis of a
hybrid beam-ray tracing algorithm. The algorithm uses beam tracing
to render large coherent regions of the image, and ray tracing to
render complex regions. A heuristic decision procedure chooses
between beam tracing and ray tracing a given area on the basis of
estimated rendering costs. Experimental results show that our
hybrid algorithm is more efficient than either ray tracing or beam
tracing used alone.
%R TR-04-88
%D 1988
%T Unsafe Operations in B-trees
%A Bin Zhang
%A Meichun Hsu
%X A simple mathematical model for analyzing the dynamics of a B-
tree node is presented. From the solution of the model, it is
shown that the simple technique of allowing a B-tree node to be
slightly less than half full can significantly reduce the rate of
split, merge and borrow operations. We call split, merge, borrow
and balance operations {\it unsafe} operations in this paper. In
a multi-user environment, lower unsafe operation rate implies less
blocking and higher throughput, even when tailored concurrency
control algorithms (e.g., that proposed by [Lehman \& Yao]) are
used. Less unsafe operation rate also means a longer life time of
an optimally initialized B-tree (e.g., compact B-tree). It is in
general useful to have an analytical model which can predict the
rate of unsafe operations in a dynamic data structure, not only
for comparing the behavior of variations of B trees, but also for
characterizing workload for performance evaluation of different
concurrency control algorithms for such data structures. The
model presented in this paper represents a starting point in this
direction.
%R TR-05-88
%D 1988
%T The Mean Value Approach to Performance Evaluation of Cautious
Waiting
%A Meichun Hsu
%A Bin Zhang
%X We propose a deadlock-free locking-based concurrency control
algorithm, called {\it cautious waiting} , which allows for a
limited form of waiting, and present an analytical solution to its
performance evaluation based on the mean-value approach. The
proposed algorithm is simple to implement. From the
modeling point of view, we are able to track both the restart rate
and the blocking rate properly, and we show that to solve for
the model we only need to solve for the root of a polynomial.
From the performance point of view, the analytical tools
developed enable us to see that the cautious waiting algorithm
manage to achieve a {\it delicate balance} between restart and
blocking, and therefore able to perform better (i.e., has higher
throughput) than both the no waiting and the general waiting
algorithms. This result is encouraging for other deadlock
avoidance-oriented locking algorithms.
%R TR-06-88
%D 1988
%T An Architecture-Independent Model for Parallel Programming
%A Gary Wayne Sabot
%X This dissertation describes a fundamentally new way of looking at
parallel programming called the paralation model. The paralation
model consists of a new data structure and a small, irreducible
set of operators. The model can be combined with any base
language to produce a concrete parallel language. One important
goal of the model is ease of use for general problem solving. The
model must provide tools that address the problems of the
programmer. Equally important, languages based upon the model
must be easy to compile, in a transparent and efficient manner,
for a broad range of target computer architectures (for example,
Multiple Instruction Multiple Data (MIMD) or Single Instruction
Multiple Data (SIMD) processors; bus-based, butterfly, or grid
interconnect; and so on). Several compilers based on this model
exist, including one that produces code for the 65,536 processor
Connection Machine. The dissertation includes a short (two pages
long) operational semantics for a paralation language based on
Common Lisp. By executing this code, the interested reader can
experiment with the paralation constructs. The dissertation (as
well as a disk containing a complete implementation of Paralation
Lisp) is available from MIT Press as "The Paralation Model:
Architecture-Independent Parallel Programming".
%R TR-07-88
%D 1988
%T Managing Databases in Distributed Virtual Memory
%A Meichun Hsu
%A Va-On Tam
%R TR-08-88
%D 1988
%T Modeling Performance Impact of Hot Spots
%A Meichun Hsu
%A Bin Zhang
%X An important factor that may affect the performance of the
concurrency control algorithm of a database system is the
skewness of the distribution of accesses to data granules
(i.e., non-uniform access pattern.) In this paper, we examine the
impact of hot spot analytically by employing the mean-value
approach to performance modeling of concurrency control
algorithms. The impact of non-uniform accesses is approached
based on a new principle of {\it data flow balance} . Using data
flow balance we generalize the b-c access pattern (i.e. b\%
of accesses directly to c\% of data granules) to arbitrary
distributions, and solve the non-uniform access model for
two classes of two phase locking algorithms. We also show that
the database reduction factor employed previously in the
literature to handle non-uniform access is in fact an upper bound.
%R TR-09-88
%D 1988
%T Analytical Preprocessing for Likelihood Ratio Methods
%A Bin Zhang
%X An analytical preprocessing (APP) method for likelihood ratio
method, an algorithm for finding the derivatives of a performance
function of Discrete Event Dynamical Systems (DEDS), is developed.
The convergency of the integral used in Likelihood ratio method
and the legality of interchanging $\frac {\partial} {\partial\mu}
$ with $\int$ are proved for a large class of sample performance
functions.
%R TR-10-88
%D 1988
%T Exemplar-based Learning: Theory and Implementation
%A Steven Salzberg
%X Exemplar-based learning is a theory of inductive learning in which
learning is accomplished by storing objects in Euclidean $n$-
space, $E^n$, as hyper-rectangles. In contrast to the usual
inductive learning paradigm, which learns by replacing symbolic
formulae by more general formulae, the generalization process for
exemplar-based learning modifies hyper-rectangles by growing and
reshaping them in a well-defined fashion. Some advantages and
disadvantages of the theory are described, and the theory is
compared to other inductive learning theories. An implementation
has been tested on several different domains, four of which are
presented here: predicting the recurrence of breast cancer,
classifying iris flowers, predicting stock prices, and predicting
survival times for heart attack patients. The robust learning
behavior and easily understandable representations produced by
the implementation support the claim that exemplar-based learning
offers advantages over other learning models for some classes of
problems.
%R TR-11-88
%D 1988
%T An Algorithm for Self Diagnosis in Distributed Systems
%A Azer A. Bestavros
%X In this paper, and based on a powerful diagnostic model, we
present a distributed diagnosis algorithm which would make a
distributed system capable of self repairing (replacing) faulty
units while operating in a gracefully degraded fault-tolerant
manner. The algorithm allows the on-line reentry of repaired
(replaced) units, and deals effectively with synchronization and
time stamping problems. We define an active set to be a set of
fault-free processing elements that agree on a unified diagnosis.
The algorithm we propose, would enable each healthy processor of
reliably identifying the largest possible active set to which it
belongs. If each processor refrains from further dealings with
units not in its active set, then logical reconfiguration can be
achieved. Concurrent with this on-line fault isolation, analyzers
in the system carry out detailed diagnosis to locate the faulty
units, and upon a proper report, the associated controllers can
perform the required repair/replace. Later, recovering units will
be allowed to reenter the system. The algorithm is shown to be
robust in that it guarantees the survival of any possible active
set. In particular, if there are no more than {\it t} faults in
the system, the algorithm guarantees that all fault-free
processors will eventually identify the fault pattern, provided
that the system is {\it t self diagnosable with without repair}
. Furthermore, if the system is {\it t self diagnosable with
repair,} then the algorithm guarantees that at least one
controller will be able to repair/replace (at least) one faulty
unit.
%R TR-12-88
%D 1988
%T On Coupling and the Approximation of the Permanent
%A Milena Mihail
%X We discuss in detail the limitations of the coupling technique for
a Markov Chain on a population with non-trivial combinatorial
structure namely, perfect and near perfect matchings of a dense
bipartite graph.
%R TR-13-88
%D 1988
%T Distributed Non-Cryptographic Oblivious Transfer with Constant
Rounds Secret Function Evaluation
%A Donald Beaver
%X We develop tools for distributively and secretly computing the
{\it parity} of a hidden value in a {\it constant number of
rounds} without revealing the secret argument or result, improving
the previous $O(log \; n)$ round requirement, and we apply them to
a {\it generalization of Oblivious Transfer} to the distributed
environment. Oblivious Transfer, in which Alice reveals a bit to
Bob with 50-50 probability, but without knowing whether or not Bob
received it, has found diverse applications. In the case where
Alice and Bob are alone, and they cannot trust each other, all
known solutions require extensive cryptography. If, however,
there exist other trustworthy parties around, we show that
oblivious transfer can be performed without any cryptography, even
if nobody knows who the trustworthy parties are. It suffices that
at most a third or a half of the participants are dishonest.
Using previously developed methods for distributed secret
computation, we give a solution requiring $O(log \; n)$ rounds of
interaction. We then give novel techniques to reduce the number of
rounds to a constant. These new techniques include computing a
random 0-1 secret, taking the multiplicative inverse of a secret,
and normalizing a secret (without revealing the result) in a
constant expected number of rounds. Such tools allow arbitrary
functions over a polynomially sized domain of arguments to be
evaluated quickly and efficiently. The methods developed for this
paper provide a basis for solutions to other problems in
distributed secret computation.
%R TR-14-88
%D 1988
%T Learning Boolean Formulae or Finite Automata is as Hard as
Factoring
%A Michael Kearns
%A Leslie G. Valiant
%X We prove that the problem of inferring from examples Boolean
formulae or deterministic finite automata is computationally as
difficult as deciding quadratic residuosity, inverting the RSA
encryption function and factoring Blum integers (composite number
$p \cdot q$ where $p$ and $q$ are primes both congruent to 3 {\it
mod} 4). These results are for the distribution-free model of
learning. They hold even when the inference task is that of
deriving a probabilistic polynomial time classification algorithm
that predicts the correct value of a random input with probability
at least $\frac {1} {2} + \frac {1} {p(s)} ,$ where $s$ is the
size of the Boolean formula or deterministic finite automaton, and
$p$ is any polynomial.
%R TR-15-88
%D 1988
%T Merging and Routing on Parallel Models of Computation
%A Daniel David Krizanc
%X The term ``parallel computation'' encompasses a large and diverse
set of problems and theory. In this thesis we use it to refer to
any computation which consists of a large collection of tightly
coupled synchronized processors working together to solve a
terminating computational problem. Models of such computation
capture to varying degrees properties of existing and proposed
parallel machines. Our main concerns here are with how the models
deal with the communication between processors and what effect the
introduction of randomness has on the models. The main results of
the thesis are: (a) a lower bound on the average time required
by a parallel comparison tree (PCT) to merge two lists (b) a
tight tradeoff between the running time, the probability of
failure and the number of independent random bits used by an
oblivious routing strategy for a fixed connection network (c) a
similar tradeoff for the problem of finding the median on the PCT
model and (d) a tradeoff between the amount of storage required by
the nodes of a network and the deterministic time complexity of a
class of oblivious routing strategies for the network.
%R TR-16-88
%D 1988
%T Parallel Bin Packing Using First-fit and K-delayed Best-fit
Heuristics
%A Azer Bestavros
%A William McKeeman
%X In this paper, we present and contrast simulation results for the
{\it Bin Packing} problem. We used the Connection Machine to
simulate the asymptotic behavior of a variety of packing
heuristics. Several parallel algorithms are considered,
presented and contrasted in the paper. The seemingly serial
nature of the bin packing simulation has prohibited previous
experimentations from going beyond sizes of several thousands
bins. We show that by adopting a fairly simple data parallel
algorithm a speedup by a factor of $N$ over straightforward serial
implementation is possible. Sizes of up to hundreds of thousands
of bins have been simulated for different parameters and
heuristics.
%R TR-17-88
%D 1988
%T Ray Tracing for Massively-Parallel Machines
%A Robert J. Walsh
%X Proper data-type modeling for ray tracing on massively-parallel
machines is presented as a basis for parallel graphics algorithm
construction. A spatial enumeration algorithm is developed,
discussed, modeled, and evaluated. The spatial enumeration
evaluation is used to motivate an octree algorithm appropriate for
parallel machines. These algorithms are suitable for any graphics
problem which can be solved using rays, including radiosity.
While the focus is computer graphics, general-purpose, massively-
parallel machine architecture is discussed sufficiently to support
the analysis and to generate some ideas for machine design.
%R TR-18-88
%D 1988
%T A New Scan-Line Algorithm for Massively-Parallel Machines
%A Robert J. Walsh
%X This paper describes how to rapidly generate a synthetic image
using a new rendering algorithm for general-purpose, massively-
parallel architectures. The presentation shows how algorithm
design decisions were made and how to customize the algorithm to a
particular architecture. Alternative algorithms are analyzed
qualitatively and quantitatively to demonstrate the superiority of
the new algorithm.
%R TR-19-88
%D 1988
%T Secure Multiparty Protocols Tolerating Half Faulty Processors
%A Donald Beaver
%X We show how any function of $n$ inputs can be evaluated by a
complete network of $n$ processors, revealing no information other
than the result of the function, and tolerating up to $t$
maliciously faulty parties for $2t < n$. We demonstrate a
resilient method to multiply secret values without using
cryptography. The crux of our method is a new, non-cryptographic
zero-knowledge technique by which a single party can secretly
share values $a_ {1} , \cdots , a_m$ along with another secret $B
= P(a_ {1} , \cdots, a_m)$, where $P$ is an arbitrary function;
and by which the party can prove to all other parties that $B =
P(a_ {1} , \cdots, a_ {m} )$, without revealing $B$ or any other
information. Using this technique we give a protocol for
multiparty private computation that improves the bound established
by previous results, which required $3t < n$. Our protocols allow
an exponentially small chance of error, but are provably optimal
in their resilience against Byzantine faults.
%R TR-20-88
%D 1988
%T Managing Event-based Replication for Abstract Data Types in
Distributed Systems
%A Abdelsalam Abdelhamid Heddaya
%X Data replication enhances the availability of data in distributed
systems. This thesis deals with the management of a particular
representation of replicated data objects that belong to Abstract
Data Types (ADT). Traditionally, replicated data objects have been
stored in terms of their {\em states} , or {\em values} . In
this thesis, I argue for the viability of state transition {\em
histories,} or {\em logs} , as a more suitable storage
representation for abstract data types in distributed computing
environments. We present two main contributions: a new protocol
for reducing message and storage requirements of histories, and a
novel reconfiguration and recovery method. In the first
protocol, we introduce the notion of {\em two phase gossip} as
the primary mechanism for managing distributed replicated event
histories. We focus our second protocol for reconfiguration and
recovery on enhancing the availability of distributed objects in
the face of sequences of failures. Additionally, our
reconfiguration protocol supports system administration functions
related to the storage of distributed objects. In combination, the
two protocols that we propose demonstrate the viability and
desirability of the distributed representation of an ADT object as
a history of the state transitions that the data object undergoes,
rather than as the value or the sequence of values that it
assumes.
%R TR-01-89
%D 1989
%T A Parallel Algorithm for Eliminating Cycles in Undirected Graphs
%A Philip Klein
%A Clifford Stein
%X We give an efficient parallel algorithm for finding a maximal set
of edge-disjoint cycles in an undirected graph. The algorithm can
be generalized to handle a weighted version of the problem.
%R TR-02-89
%D 1989
%T On the time-space complexity of reachability queries for
preprocessed graphs
%A Lisa Hellerstein
%A Philip Klein
%A Robert Wilber
%X How much can preprocessing help in solving graph problems? In
this paper, we consider the problem of reachability in a directed
bipartite graph, and propose a model for evaluating the usefulness
of preprocessing in solving this problem. We give tight bounds
for restricted versions of the model that suggest that
preprocessing is of limited utility.
%R TR-03-89
%D 1989
%T On the Magnification of 0-1 Polytopes
%A Milena Mihail
%A Umesh Vazirani
%R TR-04-89
%D 1989
%T Conductance and Convergence of Markov Chains
%A Milena Mihail
%A Umesh Vazirani
%X Let $P$ be an irreducible and strongly aperiodic (i.e. $p_ {ii}
\geq \frac {1} {2} \forall i)$ stochastic matrix. We obtain non-
asymptotic bounds for the convergence rate of a Markov chain with
transition matrix $P$ in terms of the {\it conductance} of $P$.
These results have been so far obtained only for time-reversible
Markov chains via partially linear algebraic arguments. Our
proofs eliminate the linear algebra and therefore naturally extend
to general Markov chains. The key new idea is to view the action
of a strongly aperiodic stochastic matrix as a weighted averaging
along the edges of the {\it underlying graph of} $P$. Our
results suggest that the conductance (rather than the second
largest eigenvalue) best quantifies the rate of convergence of
strongly aperiodic Markov chains.
%R TR-05-89
%D 1989
%T Transaction Synchronization in Distributed Shared Virtual Memory
Systems
%A Meichum Hsu
%A Va-On Tam
%X Distributed shared virtual memory (DSVM) is an abstraction which
integrates the memory space of different machines in a local area
network environment into a single logical entity. The algorithm
responsible for maintaining this {\it virtually} shared image is
called the {\it memory coherence algorithm} . In this paper, we
study the interplay between memory coherence and {\it process
synchronization} . In particular, we devise two-phase locking-
based algorithms in a distributed system under two scenarios:
{\it with} and {\it without} an underlying memory coherence
system. We compare the performance of the two algorithms using
simulation, and argue that significant performance gain can
potentially result from bypassing memory coherence and supporting
process synchronization directly on distributed memory. We also
study the role of the {\it optimistic} algorithms in the context
of DSVM, and show that some optimistic policy appears promising
under the scenarios studied.
%R TR-06-89
%D 1989
%T A VLSI Chip for the Real-time Information Dispersal and Retrieval
for Security and Fault-Tolerance
%A Azer Bestavros
%X In this paper, we describe SETH, a hardwired implementation of the
recently proposed ``Information Dispersal Algorithm'' (IDA). SETH
allows the real-time dispersal of information into different
pieces as well as the retrieval of the original information from
the available pieces. SETH accepts a stream of data and a set of
``keys'' so as to produce the required streams of dispersed data
to be stored on (or communicated with) the different sinks. The
chip might as well accept the streams of data from the different
sinks along with the necessary controls and keys so as to
reconstruct the original information. We begin this paper by
introducing the Information Dispersal Algorithm and give an
overview of SETH operation. The different functions are described
and system block diagrams of varying levels of details are
presented. Next, we present an implementation of SETH using
Scalable CMOS technology that has been fabricated using the MOSIS
3-micon process. We conclude the paper with potential
applications and extensions of SETH. In particular, we emphasize
the promise of the Information Dispersal Algorithm in the design
of I/O subsystems, Redundant Array of Inexpensive Disks (RAID)
systems, reliable communication and routing in
distributed/parallel systems. SETH demonstrates that using IDA in
these applications is feasible.
%R TR-07-89
%D 1989
%T General Purpose Parallel Architectures
%A L.G. Valiant
%X The possibilities for efficient general purpose parallel computers
are examined. First some network models are reviewed. It is then
shown how these networks can efficiently implement some basic
message routing functions. Finally, various high level models of
parallel computation are described and it is shown that the above
routing functions are sufficient to implement them efficiently.
%R TR-08-89
%D 1989
%T Bulk-Synchronous Parallel Computers
%A L.G. Valiant
%X We attribute the success of the von Neumann model of sequential
computation to the fact that it is an efficient bridge between
software and hardware. On the one hand high level languages can be
efficiently compiled on to this model. On the other, it can be
efficiently implemented in hardware in current technology. We
argue that an analogous bridge between software and hardware is
required for parallel computation if that is to become as widely
used. We introduce the bulk-synchronous parallel (BSP) model as a
candidate for this role. We justify this suggestion by giving a
number of results that quantify its efficiency both in
implementing some high level language features, as well as in
being implemented in hardware.
%R TR-09-89
%D 1989
%T Scheduling Initialization Equations for Parallel Execution
%A Lilei Chen
%R TR-10-89
%D 1989
%T Hiding Information from Several Oracles
%A Donald Beaver
%A Joan Feigenbaum
%X Abadi, Feigenbaum, and Kilian have considered {\it computations
with encrypted data} [AFK]. Let $f$ be a function that is not
provably computable in randomized polynomial time; randomized
polynomial-time machine A wants to query an oracle B for $f$ to
obtain $f(x)$, without telling B exactly what $x$ is. Several
well-known random-self-reducible functions, such as discrete
logarithm and qua\-dra\-tic residuosity, are {\it encryptable}
in this sense; that is, A can query B about an instance while
hiding some significant information about the instance. It is
shown in [AFK] that, if $f$ is an NP-hard function, A cannot query
B while keeping secret all but the size of the instance, assuming
that the polynomial hierarchy does not collapse. This negative
result holds even if the oracle B has ``infinite'' computational
power. Here we show that A {\it can} query $n$ oracles B$_1$,
$\ldots$, B$_n$, where $n=|x|$, and obtain $f(x)$ while hiding all
but $n$ from each B$_i$, {\it for any boolean function f} . This
answers a question due to Rivest that was left open in [AFK]. Our
proof adapts techniques developed by Ben-Or, Goldwasser, Wigderson
and Chaum, Cr\'epeau, Damg\a ard for using Shamir's {\it secret-
sharing} scheme to hide information about inputs to distributed
computations [BGW], [CCD], [S].
%R TR-11-89
%D 1989
%T Perfect Privacy for Two-Party Protocols
%A Donald Beaver
%X We examine the problem of computing functions in a distributed
manner based on private inputs, revealing the value of a function
without revealing any additional information about the inputs. A
function $f(x_1,\dots,x_n)$ is $t$-private if there is a protocol
whereby $n$ parties, each holding an input value, can compute $f,$
and no subset of $t$ or fewer parties gains any information other
than what is computable from $f(x_1,\dots,x_n).$ The class of $t$
-private {\em boolean} functions for $t \geq\lceil \frac {n} {2}
\rceil$ was described in [CK89]. We give a characterization of 1-
private functions for two parties, without the restriction to the
boolean case. We also examine a restricted form of private
computation in the $n$-party case, and show that addition is the
only privately computable function in that model.
%R TR-12-89
%D 1989
%T The Input Output Real-Time Automaton: A model for real-time
parallel computation
%A Azer A. Bestavros
%X In this paper, we propose a unified framework for embedded real-
time systems in which it would be possible to see the relation
between issues of specification, implementation, correctness, and
performance. The framework we suggest is based on the IORTA
{\it (Input-Output Real-Time Automata)} model which is an
extension of the IOA model introduced. An IORTA is an abstraction
that encapsulates a system task. An embedded system is viewed as
a set of interacting IORTAs. IORTAs communicate with each other
and with the external environment using {\it signals} . A signal
carries a sequence of {\it events} , where an event represents an
instantiation of an {\it action} at a specific point in time.
Actions can be generated by either the environment or the IORTAs.
Each IORTA has a {\it state} . The state of an IORTA is observable
and can only be changed by local {\it computations} .
Computations are triggered by actions and have to be scheduled to
meet specific timing constraints. IORTAs can be {\it composed}
together to form higher level IORTAs. A specification of an IORTA
is a description of its behavior (i.e. how it reacts to stimuli
from the environment). An IORTA is said to {\it implement}
another IORTA, if it is impossible to differentiate between their
external behaviors. This is the primary tool that is used to
verify that an implementation meets the required specification.
%R TR-13-89
%D 1989
%T The Computational Complexity of Machine Learning
%A Michael J. Kearns
%X This thesibs is a study of the computational complexity of machine
learning from examples in the distribution-free model introduced
by L.G. Valiant. In the distribution-free model, a learning
algorithm receives positive and negative examples of an unknown
target set ( {\it or concept} ) that is chosen from some known
class of sets ( {\it concept class} ). These examples are
generated randomly according to a fixed but unknown probability
distribution representing Nature, and the goal of the learning
algorithm is to infer an hypothesis concept that closely
approximates the target concept with respect to the unknown
distribution. This thesis is concerned with proving theorems
about learning in this formal mathematical model. We are
interested in the phenomenon of {\it efficient} learning in the
distribution-free model, in the standard polynomial-time sense.
Our results include general tools for determining the polynomial-
time learnability of a concept class, an extensive study of
efficient learning when errors are present in the examples, and
lower bounds on the number of examples required for learning in
our model. A centerpiece of the thesis is a series of results
demonstrating the computational difficulty of learning a number of
well-studied concept classes. These results are obtained by
reducing some apparently hard number-theoretic problems from
cryptography to the learning problems. The hard-to-learn concept
classes include the sets represented by Boolean formulae,
deterministic finite automata and a simplified form of neural
networks. We also give algorithms for learning powerful concept
classes under the uniform distribution, and give equivalences
between natural models of efficient learnability. This thesis also
includes detailed definitions and motivation for the distribution-
free model, a chapter discussing past research in this model and
related models, and a short list of important open problems.
%R TR-14-89
%D 1989
%T Learning with Nested Generalized Exemplars
%A Steven Lloyd Salzberg
%X This thesis presents a theory of learning called nested
generalized exemplar theory (NGE), in which learning is
accomplished by storing objects in Euclidean $n$-space, $E^n$, as
hyper-rectangles. The hyper-rectangles may be nested inside one
another to arbitrary depth. In contrast to most generalization
processes, which replace symbolic formulae by more general
formulae, the generalization process for NGE learning modifies
hyper-rectangles by growing and reshaping them in a well-defined
fashion. The axes of these hyper-retangles are defined by the
variables measured for each example. Each variable can have any
range on the real line; thus the theory is not restricted to
symbolic or binary values. The basis of this theory is a
psychological model called exemplar-based learning, in which
examples are stored strictly as points in $E^n$. This thesis
describes some advantages and disadvantages of NGE theory,
positions it as a form of exemplar-based learning, and compares it
to other inductive learning theories. An implementation has been
tested on several different domains, four of which are presented
in this thesis: predicting the recurrence of breast cancer,
classifying iris flowers, predicting survival times for heart
attack patients, and a discrete event simulation of a prediction
task. The results in these domains are at least as good as, and
in some cases significantly better than other learning algorithms
applied to the same data. Exemplar-based learning is emerging as a
new direction for machine learning research. The main
contribution of this thesis is to show how an exemplar-based
theory, using nested generalizations to deal with exceptions, can
be used to create very compact representations with excellent
modelling capability.
%R TR-15-89
%D 1989
%T Finite-State Analysis of Asynchronous Circuits with Bounded
Temporal Uncertainty
%A Harry R. Lewis
%R TR-16-89
%D 1989
%T Flux Tracing: A Flexible Infrastructure for Global Shading
%A Jon Christensen
%A Joe Marks
%A Robert Walsh
%A Mark Friedell
%X Flux tracing is a flexible, efficient, and easily implemented
mechanism for determining scene intervisibility. Flux tracing can
be combined with a variety of reflection models in several ways,
yielding many different global-shading techniques. Several shading
techniques based on flux tracing are illustrated. All provide
intensity gradients and shadows with penumbras resulting from area
light sources at finite distances. Some provide specular
reflection and color bleeding, and some are extremely efficient.
Flux tracing is both an expedient means of constructing an
efficient global shader and a flexible tool for experimental
development of global shading algorithms.
%R TR-17-89
%D 1989
%T Efficient Use of Image and Intervisibility Coherence in Rendering
and Radiosity Calculations
%A Mark Friedell
%A Joe Marks
%A Robert Walsh
%A Jon Christensen
%X Rendering algorithms based on image-area sampling were first
proposed almost 20 years ago. In attempting to solve the hidden-
surface problem from many pixels simultaneously, area-sampling
algorithms are the most aggressive attempts to exploit image
coherence. Although area-sampling algorithms are intuitively
appealing, they usually do not perform well. When adequate image
coherence is present, however, they can perform extremely well for
some image regions. We present in this paper two new, hybrid
rendering algorithms that combine area sampling with point-
sampling techniques. In the presence of significant image
coherence, these algorithms are considerably faster than any
generally applicable alternative algorithms; for no image are they
perceptibly slower. We also show how similar hybrid techniques can
exploit intervisibility coherence to efficiently determine form
factors for radiosity calculations.
%R TR-18-89
%D 1989
%T The Role of User Models in System Design
%A Lisa Rubin Neal
%X The inadequacy of systems designer's models of users has led to
difficulties with the use of and the rejection of systems. We
formulate a multi-dimensional user model, of which two dimensions
are related to learning and decision-making styles and additional
dimensions are related to the plurality of expertises involved in
the use of a system. The incorporation of a user model
characterizing relevant cognitive variables allows the tailoring
of a system to the needs and abilities of its users. We use a
number of computer games as rich tasks which allow us to derive
information about users. We present the results of our research
examining computer games and discuss the implications of this
research for system design.
%R TR-19-89
%D 1989
%T Fast Fault-Tolerant Parallel Communication with Low Congestion and
On-Line Maintenance Using Information Dispersal
%A Yuh-Dauh Lyuu
%X Space-efficient Information Dispersal Algorithm (IDA) is applied
to communication in parallel computers to achieve fast
communication, low congestion, and fault tolerance on various
networks. All schemes run {\em within\/} their said time bounds
without long delay. Let $N$ denote the size of the network. In the
case of hypercube, our communication scheme runs in $\,2\cdot \log
N+1\,$ time using constant size buffers. Its probability of
successful routing is at least $\,1-N^ {-2.419\cdot\log N + 1.5}
$, proving Rabin's conjecture. The same scheme tolerates
$\,N/(12\cdot e\cdot\log N)\,$ random link failures with
probability at least $\,1-2\cdot N\cdot (\log N)^ {-\log N/12} \,$
($e=2.718\ldots$). It can also tolerate $\,N/c\,$ random link
failures with probability $\,1- O(N^ {-1} )\,$ for some constant
$c$. For a class of $d$-way shuffle networks, our scheme runs in
$\,\approx 2\cdot \ln N/\ln\ln N\,$ time using constant size
buffers. Its probability of successful routing is at least $\,1-N^
{-\ln N/2} $. The same scheme tolerates $\,N/12\,$ random link
failures with probability at least $\,1- N^ {-\ln\ln\ln N/12} .\,$
For a class of $d$-way digit-exchange networks, our scheme runs in
$\,\approx 6\cdot \ln N/\ln\ln N\,$ time using constant size
buffers. Its probability of successful routing is at least $\,1-
o(N^ {-7\cdot\ln N} )$. The same scheme tolerates $\, N/(6\cdot
e)\,$ random link failures with probability at least $\,1-2\cdot
N^ {-\ln\ln\ln N/2} $. Another fault model where links fail
independently with a constant probability is also considered.
Numerical calculations show that with practical failure
probabilities and sizes of the hypercube, our routing scheme for
the hypercube performs well with high probability. On-line and
efficient wire testing and replacement on the hypercube can be
realized if our fault-tolerant routing scheme is used. Let
$\,\alpha\,$ denote the total number of links. It is shown that
$\,\approx\alpha/352\,$ wires can be disabled simultaneously
without disrupting the ongoing computation or degrading the
routing performance much.
%R TR-20-89
%D 1989
%T Lower Bounds on Parallel, Distributed and Automata Computations
%A Mih\'{a}ly Ger\'{e}b-Graus
%X In this thesis we present a collection of lower bound results: A
hierarchy of complexity classes on tree languages (analogous to
the polynomial hierarchy) accepted by alternating finite state
machines is introduced. By separating the deterministic and the
nondeterministic classes of our hierarchy we give a negative
answer to the folklore question whether the expressive power of
the treeautomata is the same as that of the finite state automaton
that can walk on the edges of the tree (bugautomaton). We prove
that three-head one-way DFA cannot perform string-matching, that
is, no three-head one-way DFA accepts the language $L=\{x\#y \mid
x$ {\rm is a substring of} $y$, {\rm where} $x,y \in \{0,1\}
^*\} $. We prove that in a one round fair coin flipping (or
voting) scheme with $n$ participants, there is at least one
participant who has a chance to decide the outcome with
probability at least $3/n-o(1/n)$. We prove an optimal lower
bound on the average time required by any algorithm that merges
two sorted lists on the parallel comparison tree model. We
present a proof of a negative answer for a question raised by
Skyum and Valiant, namely, whether the class of symmetric boolean
functions has a p-complete family. We give a combinatorial
characterization for the concept classes learnable from negative
(or positive) examples only in the so called distribution free
learning model.
%R TR-21-89
%D 1989
%T A Data Mapping Parallel Language
%A Vinod Kathail
%A Dan C. Stefanescu
%R TR-22-89
%D 1989
%T An Analysis of the Valiant-Brebner Hypercube\\ Routing Algorithm
%A Athanasios Tsantilas
%X We present the best known analysis of the running time of the
Valiant-Brebner algorithm for routing $h$-relations in the $n$-
dimensional hypercube. Refining an elegant proof technique due to
Ranade we prove that for $h \sim n/\omega(n)$, where $\omega(n)$
tends to infinity arbitrarily slowly, the running time of this
algorithm is $2n+O(n/\ln\omega(n))$ with very high probability.
The same analysis holds for the directed $n$-butterfly as well.
%R TR-23-89
%D 1989
%T Tight Bounds for Oblivious Routing in the Hypercube
%A C. Kaklamanis
%A D. Krizanc
%A A. Tsantilas
%X We prove that given an $N$-node communication network with maximum
indegree $d$, any deterministic oblivious algorithm for routing an
arbitrary permutation requires $\Omega(\sqrt {N} /d)$ time. This
is an improvement of a result by Borodin and Hopcroft. For the
$N$-node hypercube, in particular, we show a matching upper bound
by exhibiting a deterministic oblivious algorithm which routes any
permutation in $O(\sqrt {N} / \log N)$ time. The best previously
known upper bound was $O(\sqrt {N} )$.
%R TR-24-89
%D 1989
%T SIMD Algorithms for Image Rendering
%A Robert J. Walsh
%X A parameterized cost model for SIMD machines is developed and used
to rank projective image rendering algorithms. The ranking
considers such an extremely broad range of remote communication
costs that it holds for all practical situations. Ranked first is
a new projective algorithm presented in the thesis. Empirical
results support the ranking developed from the analytic models.
The cost model is also applied to ray tracing, so that precomputed
approximate sorts of surface primitives can be rigorously used to
speed parallel ray tracing. The general-purpose architecture
assumption is justified by showing that special-purpose machines
actually render images more slowly. We conjecture that neither
SIMD nor MIMD architectures have a fundamental advantage when
efficient algorithms are known for both machine types. The
algorithms and analysis presented can serve as a model for SIMD
computations in other application domains.
%R TR-01-90
%D 1990
%T Perfect Privacy for Two-Party Protocols
%A Donald Beaver
%X We examine the problem of computing functions in a distributed
manner based on private inputs, revealing the value of a function
without revealing any additional information about the inputs. A
function $f(x_1,\dots,x_n)$ is $t$-private if there is a protocol
whereby $n$ parties, each We holding an input value, can compute
$f,$ and no subset of $t$ or fewer parties gains any information
other than what is computable from $f(x_1,\dots,x_n).$ The class
of $t$-private {\em boolean} functions for $t \geq\lceil \frac {n}
{2} \rceil$ was described in {ck89} . We give a characterization
of 1-private functions for two parties, without the restriction to
the boolean case. We also examine a restricted form of private
computation in the $n$-party case, and show that addition is the
only $t$-privately computable function in that model. Incorrect
proofs of this characterization appeared in [Kushilevitz,1989] and
an earlier technical report [Beaver,1989]. We present a different
proof which avoids the errors of those works. This report
supersedes Harvard Technical Report TR-11-89.
%R TR-02-90
%D 1990
%T Cooperative Dialogues While Playing\\ Adventure
%A David Albert
%X To collect examples of the use of hedging phrases in informal
conversation, three pairs of subjects were (separately) asked to
cooperate in playing the computer game {\em Adventure} . While
one typed commands into the computer, the two engaged in
conversation about their options and best strategy. The
conversation was recorded on audio tape, and their computer
session was saved in a disk file. Subsequently, the audio tape
was transcribed and merged with the computer record of the game,
to produce a combined transcript showing what was typed along with
a simultaneous running commentary by the participants. This
report contains the complete transcripts and a discussion of the
collection and transcription methodology.
%R TR-03-90
%D 1990
%T M-LISP:\\ A Representation Independent Dialect of LISP with
Reduction Semantics
%A Robert Muller
%X In this paper we propose to reconstruct LISP from first principles
with an eye toward reconciling S-expression LISP's metalinguistic
facilities with the kind of operational semantics advocated by
Plotkin {pl81} . After reviewing the original definition of LISP
we define the abstract syntax and the operational semantics of the
new dialect, M-LISP, and show that its equational theory is
consistent. Next we develop the operational semantics of an
extension of M-LISP which features an explicitly callable {\em
eval} and {\em fexprs} (i.e., procedures whose arguments are
passed {\em by-representation} ). Since M-LISP is independent of
any representation of its programs it has no {\em quotation}
operator or any of its related forms. To compensate for this we
encapsulate the shifting between mention and use which is
performed globally by {\em quote} within the metalinguistic
constructs that require it. The resulting equational system is
shown to be inconsistent. We leave it as an open problem to find
confluent variants of the metalinguistic constructs.
%R TR-04-90
%D 1990
%T Syntax Macros in M-LISP:\\ A Representation Independent Dialect of
LISP with Reduction Semantics
%A Robert Muller
%X In this paper we consider syntactic abstraction in M-LISP, a
dialect of LISP which is independent of any representation of its
programs. Since it is independent of McCarthy's original Meta-
expression representation scheme, M-LISP has no {\em quotation}
form or any of its related forms {\em backquote} , {\em unquote}
or {\em unquote-splicing} . Given that LISP macro systems depend
on this latter representation model, it is not obvious how to base
syntax extensions in M-LISP. Our approach is based on an
adaptation of the {\em Macro-by-Example} {kowa87} and {\em
Hygienic} algorithms {kofrfedu86} . The adaptation for the
quotation-free syntactic structure of M-LISP yields a
substantially different model of syntax macros. The most
important difference is that $\lambda$ binding patterns become
apparent when an abstraction is first (i.e., partially)
transcribed in the syntax tree. This allows us to define tighter
restrictions on the capture of identifiers. This is not possible
in S-expression dialects such as Scheme since $\lambda$ binding
patterns are not apparent until the tree is completely
transcribed.
%R TR-05-90
%D 1990
%T Semantics Prototyping in M-LISP\\ (Extended Abstract)
%A Robert Muller
%X In this paper we describe a new semantic
metalanguage which simplifies prototyping of programming
languages. The system integrates Paulson's semantic grammars
within a new dialect of LISP, M-LISP, which has somewhat closer
connections to the $\lambda$-calculus than other LISP dialects
such as Scheme. The semantic grammars are expressed as attribute
grammars. The generated parsers are M-LISP functions that can
return denotational (i.e., higher-order) representations of
abstract syntax. We illustrate the system with several examples
and compare it to related systems.
%R TR-06-90
%D 1990
%T ESPRIT\\ Executable Specification of Parallel Real-time
Interactive Tasks
%A Azer Bestavros
%X The vital role that embedded systems are playing and will
continue to play in our world, coupled with their increasingly
complex and critical nature, demand a rigorous and systematic
treatment that recognizes their unique requirements. The Time-
constrained Reactive Automaton (TRA) is a formal model of
computation that admits these requirements. Using the TRA model,
an embedded system is viewed as a set of {\em asynchronously}
interacting automata (TRAs), each representing an {\em autonomous}
system entity. TRAs are {\em input enabled} ; they communicate by
signaling events on their {\em output channels} and by reacting
to events signaled on their {\em input channels} . The behavior of
a TRA is governed by {\em time-constrained causal relationships}
between {\em computation-triggering} events. The TRA model is
{\em compositional} and allows time, control, and computation
{\em non-determinism} . In this paper we present ESPRIT, a
specification language that is entirely based on the TRA model. We
have developed a compiler that allows ESPRIT specifications to be
executed in simulated time, thus, providing a valuable validation
tool for embedded system specifications. We are currently
developing another compiler that would allow the execution of
ESPRIT specifications in real-time, thus, making it possible to
write real-time programs directly in ESPRIT.
%R TR-07-90
%D 1990
%T A Logic of Concrete Time Intervals\\ (Extended Abstract)
%A Harry R. Lewis
%X This paper describes (1) a finite-state model for asynchronous
systems in which the time delays between the scheduling and
occurrence of the events that cause state changes are constrained
to fall between fixed numerical upper and lower time bounds; (2) a
branching-time temporal logic suitable for describing the temporal
and logical properties of asynchronous systems, for which the
structures of (1) are the natural models; and (3) a functional
verification system for asynchronous circuits, which generates,
from a boolean circuit with general feedback and specified min/max
rise and fall times for the gates, a finite-state structure as in
(1), and then exhaustively checks a formal specification of that
circuit in the language (2) against that finite-state model.
%R TR-08-90
%D 1990
%T Generating Descriptions that Exploit a User's Domain\\ Knowledge
%A Ehud Reiter
%X Natural language generation systems should customize object
descriptions according to the extent of their user's domain and
lexical knowledge. The task of generating customized descriptions
is formalized as a task of finding descriptions that are {\it
accurate} (truthful), {\it valid} (fulfill the speaker's
communicative goal), and {\it free of false implicatures} (do not
give rise to unintended conversational implicatures) with respect
to the current user model. An algorithm that generates
descriptions that meet these constraints is described, and the
computational complexity of the generation problem is discussed.
%R TR-09-90
%D 1990
%T The Computational Complexity of Avoiding Conversational\\
Implicatures
%A Ehud Reiter
%X Referring expressions and other object descriptions should be
maximal under the Local Brevity, No Unnecessary Components, and
Lexical Preference preference rules; otherwise, they may lead
hearers to infer unwanted conversational implicatures. These
preference rules can be incorporated into a polynomial time
generation algorithm, while some alternative formalizations of
conversational implicature make the generation task NP-Hard.
%R TR-10-90
%D 1990
%T Generating Appropriate Natural Language Object Descriptions
%A Ehud Baruch Reiter
%X Natural language generation (NLG) systems must produce different
utterances for users with different amounts of domain and lexical
knowledge. A utterance that is meant to be read by an expert
should use technical vocabulary, and avoid explicitly mentioning
facts the expert can immediately infer from the rest of the
utterance. In contrast, an utterance that is meant to be read by
a novice should avoid specialized vocabulary, and may be required
to explicitly mention facts that would be obvious to an expert.
An NLG system that does not customize utterances according to its
user's domain and lexical knowledge may generate text that is
incomprehensible to a novice, or text that leads an expert to
infer unwanted {\it conversational implicatures} (Grice 1975).
This thesis examines the problem of generating attributive
descriptions of individuals, that is object descriptions that are
intended to inform the user that a particular object has certain
attributes. It proposes that such descriptions will be appropriate
for a particular user if they are {\it accurate, valid} , and
{\it free of false implicatures} with respect to a user-model
that represents that user's relevant domain and lexical
knowledge. Descriptions are represented as definitions of KL-ONE-
type (Brachman and Schmolze 1985) classes, and a description is
called {\it accurate} if it defines a class that subsumes the
object being described; {\it valid} if every attribute the
system wishes to communicate is either part of the description or
a default attribute that is inherited by the class defined by the
description; and {\it free of false implicatures} if it is
maximal under three preference rules: No Unnecessary Components,
Local Brevity, and Lexical Preference.
%R TR-11-90
%D 1990
%T Domain Theory for Nonmonotonic Functions
%A Yuli Zhou
%A Robert Muller
%X We prove several lattice theoretical fixpoint theorems based on
the classical result of Knaster-Tarski. These theorems give
sufficient conditions for a system of generally nonmonotonic
functions, on a complete lattice to define a unique fixpoint. The
primary objective of this paper is to develop a domain theoretic
framework to study the semantics of general logic programs as well
as various rule-based systems where the rules define nonmonotonic
functions on lattices.
%R TR-12-90
%D 1990
%T A Syntax and Semantics for Network Diagrams
%A Joe Marks
%X The ability to automatically design graphical displays of data
will be important for the next generation of interactive computer
systems. The research reported here concerns the automated design
of network diagrams, one of the three main classes of symbolic
graphical display (the other two being chart graphs and maps).
Previous notions of syntax and semantics for network diagrams are
not adequate for automating the design of this kind of graphical
display. I present here a new formulation of syntax and semantics
for network diagrams that is used in the ANDD (Automated Network
Diagram Designer) system. The syntactic formulation differs from
previous work in two significant ways: perceptual-organization
phenomena are explicitly represented, and syntax is described in
terms of constraints rather than as a grammar of term-rewriting
rules. The semantic formulation is based on an application-
independent model of network systems that can be used to model
many real-world applications. The paper includes examples that
show how these concepts are used by ANDD to automatically design
network diagrams.
%R TR-13-90
%D 1990
%T Avoiding Unwanted Conversational Implicatures in Text and Graphics
%A Joe Marks
%A Ehud Reiter
%X We have developed two systems, FN and ANDD, that use natural
language and graphical displays, respectively, to communicate
information about objects to human users. Both systems must deal
with the fundamental problem of ensuring that their output does
not carry unwanted and inappropriate conversational implicatures.
We describe the types of conversational implicatures that FN and
ANDD can avoid, and the computational strategies the two systems
use to generate output that is free of unwanted implicatures.
%R TR-14-90
%D 1990
%T Models of Plans to Support Communications:\\ An Initial Report
%A Karen E. Lochbaum
%A Barbara J. Grosz
%A Candace L. Sidner
%X Agents collaborating to achieve a goal bring to their joint
activity different beliefs about ways in which to achieve the goal
and the actions necessary for doing so. Thus, a model of
collaboration must provide a way of representing and
distinguishing among agents' beliefs and of stating the ways in
which the intentions of different agents contribute to achieving
their goal. Furthermore, in collaborative activity, collaboration
occurs in the planning process itself. Thus, rather than
modelling plan recognition, per se, what must be modelled is the
{\it augmentation} of beliefs about the actions of multiple agents
and their intentions. In this paper, we modify and expand the
SharedPlan model of collaborative behavior (Grosz \& Sidner 1990).
We present an algorithm for updating an agent's beliefs about a
partial SharedPlan and describe an initial implementation of this
algorithm in the domain of network management.
%R TR-15-90
%D 1990
%T An Information Dispersal Approach to Issues in Parallel Processing
%A Yuh-Dauh Lyuu
%X Efficient schemes for the following issues in parallel processing
are presented: fast communication, low congestion, fault
tolerance, simulation of ideal parallel computation models,
synchronization in asynchronous networks, low sensitivity to
variations in component speed, and on-line maintenance. All our
schemes employ Rabin's information dispersal idea. We also
develop an efficient information dispersal algorithm (IDA) based
on the Fast Fourier Transform and an IDA-based voting scheme to
enforce fault tolerance. Let $N$ denote the size of the hypercube
network. We present a randomized communication scheme, FSRA (for
``Fault-tolerant Subcube Routing Algorithm''), that routes in $2
\cdot \log N + 1$ time using only constant size buffers and with
probability of success $1 - N^ {-\Theta (\log N)} $. (All log's
are to the base 2.) FSRA also tolerates $O(N)$ random link
failures with high probability. Similar results are also obtained
for the de Bruijn and the butterfly networks (without fault
tolerance in the latter case). FSRA is employed to simulate,
without using hashing, a class of CRCW PRAM (concurrent-read
concurrent-write parallel random access machine) programs with a
slowdown of $O(\log N)$ with almost certainty if combining is
used. A fault-tolerant simulation scheme for general CRCW PRAM
programs is also presented. A simple acknowledgement synchronizer
can make all our routing schemes in this dissertation run on
asynchronous networks without loss of efficiency. We further show
that speed of any component--be it a processor or a link--has only
linear impact on the run-time of FSRA; that is, the extra delay in
run-time is only proportional to the drift in the component's
delay and is independent of the size of the network. On-line
maintainability makes the machine more available to the user. We
show that, under FSRA, a constant fraction of the links can be
disabled with essentially no impact on the routing performance.
This result immediately suggests several efficient maintenance
procedures. Based on the above results, a fault-tolerant parallel
computing system, called HPC (for ``hypercube parallel
computer''), is sketched at the end of this dissertation.
%R TR-16-90
%D 1990
%T Identifying $\mu$-Formula Decision Trees with Queries
%A Thomas R. Hancock
%X We consider a learning problem for the representation class of
$\mu$-formula decision trees, a generalization of $\mu$-formulas
and $\mu$-decision trees. (The ``$\mu$'' form of a representation
has the restriction that no variable appears more than once). The
learning model is one of exact identification by oracle queries,
where learner's goal is to discover an unknown function by asking
membership queries (is the function true on some specified input?)
and equivalence queries (is the function identical to some
hypothesis we present, and if not what is an input on which they
differ?). We present an identification algorithm using these two
types of queries that runs in time polynomial in the number of
variables, we and show that no such polynomial time algorithm
exists that uses either membership or equivalence queries alone
(in the latter case under the stipulation that the hypotheses are
drawn from the same representation class). We further extend the
algorithm to identify a broader class where the formulas taken
over a more powerful basis including arbitrary threshold gates.
%R TR-17-90
%D 1990
%T Design and Modeling with Schema Grammars
%A Mark Friedell
%A Sandeep Kochhar
%X Graphical scene modeling is usually a time-consuming, tedious, and
expensive manual activity. This paper proposes an approach to
partially automating the process though a paradigm for cooperative
user-computer scene modeling that we call {\em cooperative
computer-aided design} (CCAD). Formal grammars, referred to as
{\em schema grammars} , are used to imbue the modeling system with
an elementary ``understanding'' of the kinds of scenes to be
created. The grammar interpreter constructs part or all of the
scene after accepting from the user partially completed scene
components and descriptions of scene properties that must be
incorporated into the final scene. This approach to modeling
harnesses the power of the computer to construct scene detail,
thereby freeing the human user to focus on essential creative
decisions. This paper describes the structure and interpretation
of schema grammars, and provides techniques for controlling the
combinatorial explosion that can result from the undirected
interpretation of the grammars. CCAD is explored in the context of
two experimental systems---FLATS, for architectural design, and
LG, for modeling landscapes.
%R TR-18-90
%D 1990
%T Cooperative Computer-Aided Design: A Paradigm for Automating the
Design and Modeling of Graphical Objects
%A Sandeep Kochhar
%X Design activity is often characterized by a search in which the
designer examines various alternatives at several stages during
the design process. Current computer-aided design (CAD) systems,
however, provide very little support for this exploratory aspect
of design. My research provides the foundation for {\em
cooperative computer-aided design (CCAD)} ---a novel CAD
technology that intersperses partial exploration of design
alternatives by the computer with guiding design operations by the
system user. CCAD combines the strengths of manual and automated
design and modeling by allowing the user to make creative
decisions and provide specialized detail, while exploiting the
power of the machine to explore many design alternatives and
create detail that is resonant with the user's design decisions.
In the CCAD paradigm, the user expresses initial design decisions
in the form of a partial design and a set of properties that the
final design must satisfy. The user then initiates the generation
by the system of alternative {\em partial} developments of the
initial design subject to a ``language'' of valid designs. The
results are then structured in a spatial framework through which
the user moves to explore the alternatives. The user selects the
most promising partial design, refines it manually, and then
requests further automatic development. This process continues
until a satisfactory complete design is created. I present the
interpretation of schema grammars---a class of generative grammars
for manipulating graphical objects---as the fundamental generative
mechanism underlying CCAD. I describe in detail several mechanisms
for providing user control over the generative process, thereby
controlling the combinatorial explosion inherent in an
unrestricted, undirected interpretation of generative grammars. I
explore graphical browsing as a facility for efficiently perusing
the set of design alternatives generated by the system. I also
describe FLATS ( {\bf F} loor plan {\bf LA} you {\bf T} {\bf S}
ystem)---a prototype CCAD system for the design of small
architectural floor plans.
%R TR-19-90
%D 1990
%T Understanding Subsumption and Taxonomy:\\ A Framework for Progress
%A William A. Woods
%X This paper continues a theme begun in my paper, ``What's in a
Link,'' -- seeking a solid foundation for network representations
of knowledge. Like its predecessor, its goal is to clarify issues
and establish a framework for progress. The paper analyzes the
concepts of subsumption and taxonomy and synthesizes a framework
that integrates and clarifies many previous approaches and goes
beyond them to provide an account of abstract and partially
defined concepts. The distinction between definition and
assertion is reinterpreted in a framework that accommodates
probabilistic and default rules as well as universal claims and
abstract and partial definitions. Conceptual taxonomies in this
framework are shown to be useful for indexing and organizing
information and for managing the resolution of conflicting
defaults. The paper introduces a distinction between intensional
and extensional subsumption and argues for the importance of the
former. It presents a classification algorithm based on
intensional subsumption and shows that its typical case complexity
is logarithmic in the size of the knowledge base.
%R TR-20-90
%D 1990
%T The KL-ONE Family
%A William A. Woods
%A James G. Schmolze
%X The knowledge representation system KL-ONE has been one of the
most influential and imitated knowledge representation systems in
the Artificial Intelligence community. Begun at Bolt Beranek and
Newman in 1978, KL-ONE pioneered the development of taxonomic
representations that can automatically classify and assimilate new
concepts based on a criterion of terminological subsumption. This
theme generated considerable interest in both the formal community
and a large community of potential users. The KL-ONE community
has since expanded to include many systems at many institutions
and in many different countries. This paper introduces the KL-ONE
family and discusses some of the main themes explored by KL-ONE
and its successors. We give an overview of current research,
describe some of the systems that have been developed, and outline
some future research directions.
%R TR-21-90
%D 1990
%T Efficiency of Semi-Synchronous versus Asynchronous Networks
%A Hagit Attiya
%A Marios Mavronicolas
%X The $s$-session problem is studied in {\em asynchronous} and
{\em semi-synchronous} networks. Processes are located at the
nodes of an undirected graph $G$ and communicate by sending
messages along links that correspond to the edges of $G$. A
session is a part of an execution in which each process takes at
least one step; an algorithm for the $s$-session problem
guarantees the existence of at least $s$ disjoint sessions. The
existence of many sessions guarantees a degree of interleaving
which is necessary for certain computations. It is assumed that
the (real) time for message delivery is at most $d$. In the
asynchronous model, it is assumed that the time between any two
consecutive steps of any process is in the interval $[0,1]$; in
the semi-synchronous model, the time between any two consecutive
steps of any process is in the interval $[c,1]$ for some $c$ such
that $0 < c \leq 1$, the {\em synchronous} model being the
special case where $c=1$. In the {\em initialized} case of the
problem, all processes are initially synchronized and take a step
at time 0. For the asynchronous model, an fupper bound of
$diam(G)(d+1)(s-1)$ and a lower bound of $diam(G)d(s-1)$ are
presented; $diam(G)$ is the {\em diameter} of $G$. For the semi-
synchronous model, an upper bound of $1+\min\{\lfloor \frac {1}
{c} \rfloor +1, diam(G)(d+1)\} (s-2)$ is presented. The main
result of the paper is a lower bound of $1+\min\{\lfloor \frac
{1} {2c} \rfloor, diam(G)d\} (s-2)$ for the time complexity of any
semi-synchronous algorithm for the $s$-session problem, under the
assumption that $d \geq \frac {d} {\min\{\lfloor\frac {1} {2c}
\rfloor, diam(G)d\}} + 2$. These results imply a time separation
between semi-synchronous (in particular, synchronous) and
asynchronous networks. Similar results are proved for the case
where delays are not uniform. In the {\em uninitialized} case of
the problem, all processes but one, the {\em initiator} , start
in a {\em quiescent} state which they may leave upon receiving a
message. Similar results are proved for this case.
%R TR-22-90
%D 1990
%T Efficient execution of homogeneous tasks with unequal run times on
the Connection Machine
%A Azer Bestavros
%A Thomas Cheatham
%X A lot of scientific applications require the execution of a large
number of identical {\em tasks} , each on a different set of data.
Such applications can easily benefit from the power of SIMD
architectures ( {\em e.g. the Connection Machine} ) by having the
array of processing elements (PEs) execute the task in parallel on
the different data sets. It is often the case, however, that the
task to be performed involves the repetitive application of the
same sequence of steps, {\em a body} , for a number of times that
depend on the input or computed data. If the usual {\em task-level
synchronization} is used, the utilization of the array of PEs
degrades substantially. In this paper, we propose a {\em body-
level synchronization} scheme that would boost the utilization of
the array of PEs while keeping the required overhead to a minimum.
We mathematically analyze the proposed technique and show how to
optimize its performance for a given application. Our technique is
especially efficient when the number of tasks to be executed is
much larger than the number of physical PEs available.
%R TR-23-90
%D 1990
%T Modelling Act-Type Relations in Collaborative Activity
%A Cecile T. Balkanski
%X Intelligent agents collaborating to achieve a common goal direct a
significant portion of their effort to planning together. Because
having a plan to perform a given task involves having knowledge
about the ways in which the performance of a set of actions will
lead to the performance of that task, the representation of
actions and of relations among them is a central concern. This
paper provides a formalism for representing act-type relations and
complex act-type constructors in multiagent domains. To determine
a set of relations that would span the space of complex
collaborative actions, I analyzed a videotape of two people
building a piece of furniture, and identified a group of relations
that are adequate to represent the relationships among the actions
occurring in the data. The definitions provided here are more
complex than those used in earlier studies; in particular, I
refine and expand Pollack's (1986) set of act-type relations
(generation and enablement), provide constructors for building
complex act-types from (simpler) act-types (simultaneity,
conjunction, sequence and iteration), and make it possible to
represent the joint actions of multiple agents.
%R TR-24-90
%D 1990
%T Security, Fault Tolerance, and Communication Complexity in
Distributed Systems
%A Donald Rozinak Beaver
%X We present efficient and practical algorithms for a large,
distributed system of processors to achieve reliable computations
in a secure manner. Specifically, we address the problem of
computing a general function of several private inputs distributed
among the processors of a network, while ensuring the correctness
of the results and the privacy of the inputs, despite accidental
or malicious faults in the system. Communication is often the
most significant bottleneck in distributed computing. Our
algorithms maintain a low cost in local processing time, are the
first to achieve optimal levels of fault-tolerance, and most
importantly, have low communication complexity. In contrast to
the best known previous methods, which require large numbers of
rounds even for fairly simple computations, we devise protocols
that use small messages and a constant number of rounds {\em
regardless} of the complexity of the function to be computed.
Through direct algebraic approaches, we separate the {\em
communication complexity of secure computing} from the {\em
computational complexity} of the function to be computed. We
examine security under both the modern approach of computational
complexity-based cryptography and the classical approach of
unconditional, information-theoretic security. We develop a clear
and concise set of definitions that support formal proofs of
claims to security, addressing an important deficiency in the
literature. Our protocols are provably secure. In the realm of
information-theoretic security, we characterize those functions
which two parties can compute jointly with absolute privacy. We
also characterize those functions which a weak processor can
compute using the aid of powerful processors without having to
reveal the instances of the problem it would like to solve. Our
methods include a promising new technique called a {\em locally
random reduction} , which has given rise not only to efficient
solutions for many of the problems considered in this work but to
several powerful new results in complexity theory.
%R TR-25-90
%D 1990
%T Communication Issues in Parallel Computation
%A Athanasios M. Tsantilas
%X This thesis examines the problem of interprocessor communication
in realistic parallel computers. In particular, we consider the
problem of permutation routing and its generalizations in the
mesh, hypercube and butterfly networks. Building on previous
research, we derive lower bounds for a wide class of deterministic
routing algorithms which imply that such algorithms create heavy
traffic congestion. In contrast, we show that randomized routing
algorithms result in both efficient and optimal upper bounds in
the above networks. Experiments were also performed to test the
behaviour of the randomized algorithms. These experiments suggest
interesting theoretical problems. We also examine the problem of
efficient interprocessor communication in a model suggested by
recent advances in optical computing. The main argument of this
thesis is that communication can be made efficient if
randomization is used in the routing algorithms.
%R TR-01-91
%D 1991
%T QCD on the Connection Machine: Beyond $^*\hbox{LISP}$
%A Ralph G. Brickner
%A Clive F. Baillie
%A S. Lennart Johnsson
%X We report on the status of code development for a simulation of
Quantun Chromodynamics (QCD) with dynamical Wilson fermions on the
Connection Machine model CM-2. Our original code, written in \*Lisp,
gave performance in the near -GFLOPS range. We have rewritten the
most time-consuming parts of the code in the low-level programming
system CMIS, including the matrix multiply and the communication.
Current versions of the code run at approximately 3.6 GFLOPS for the
fermion matrix inversion, and we expect the next version to reach or
exceed 5 GFLOPS.
%R TR-02-91
%D 1991
%T Communication and I/O Libraries
%A S. Lennart Johnsson
%A Patrick Worley
%R TR-03-91
%D 1991
%T Optimal Communication Channel Utilization for Matrix Transposition
and Related Permutations on Boolean Cubes
%A S. Lennart Johnsson
%A Ching-Tien Ho
%X Permutations in which each processor sends a unique message to
every other processor are frequently used in multi-processor
systems. We present optimal schedules for such all-to-all
personalized communications with concurrent communication on all
channels in Boolean cube networks. For \ $ K $ \ elements per
processor the number of element transfers in sequence is \ $ \frac
{K} {2} ,$ \ independent of the size of the cube. For a succession
of all-to-all personalized communications on disjoint subcubes of
\ $ \beta $ \ dimensions each, the schedules yield \ $ \frac {K}
{2} + \sigma_ {r} - \beta, \sigma_ {r} + 3 \beta, $ \ or \ $ \frac
{K} {2} + 2 \beta $ \ element exchanges in sequence whichever is
smallest. \ \ $ \sigma_ {r} $ \ is the total number of processor
dimensions in the permutation. The algorithms can be organized
such that indirect addressing is not required for interprocessor
exchanges. We also present a blocking procedure that minimizes
the number of block transfers and the block size for a given
schedule. For schedules with optimal channel utilization the
number of block transfers is \ $ \beta $ \ and the maximum block
size is \ $ \lceil \frac {k} {2 \beta} \rceil $ \ for all-to-all
personalized communication on a \ $ \beta -$cube. The blocking
procedure can also be used to generate optimal schedules for
channel widths greater than the width of the data items. An
implementation on the Connection Machine of one of the algorithms
offers a maximum speed-up of 50\% compared to the previously best
known algorithm.
%R TR-04-91
%D 1991
%T Generalized Shuffle Permutations on Boolean Cubes
%A S. Lennart Johnsson
%A Ching-Tien Ho
%X In a {\em generalized shuffle permutation} an address \ $ (a_ {q-
1} -a_ {q-2} \ldots a_ {} ) $ \ receives its content from an
address obtained through a cyclic shift on a subset of the \ $ q $
\ dimensions used for the encoding of the addresses. Bit-
complementation may be combined with the shift. We give an
algorithm that requires \ $ \frac {K} {2} + 2 $ \ exchanges for \
$ K $ \ elements per processor, when storage dimensions are part
of the permutation, and concurrent communication on all ports of
every processor is possible. The number of element exchanges in
sequence is independent of the number of processor dimensions \ $
\sigma_ {r} $ \ in the permutation. With no storage dimensions in
the permutation our best algorithm requires \ $ (\sigma_ {r} + 1)
\lceil \frac {K} {2 \sigma_ {r}} \rceil $ \ element exchanges. We
also give an algorithm for \ $ \sigma_ {r} = 2, $ \ or the real
shuffle consists of a number of cycles of length two, that
requires \ $ \frac {K} {2} + 1 $ \ element exchanges in sequence
when there no bit complement. The lower bound is \ $ \frac {K}
{2} $ \ for both real and mixed shuffles with no bit
complementation. The minimum number of communication start-ups is
\ $ \sigma_ {r} $ \ for both cases, which is also the lower bound.
The data transfer time for communication restricted to one port
per processor is \ $ \sigma_ {r} \frac {K} {2} , $ \ and the
minimum number of start-ups is \ $ \sigma_ {r} . $ \ The analysis
is verified by experimental results on the Intel iPSC/1, and for
one case also on the Connection Machine.
%R TR-05-91
%D 1991
%T The Computational Complexity of Cartographic Label\\ Placement
%A Joe Marks
%A Stuart Shieber
%X We examine the computational complexity of cartographic label
placement, a problem derived from the cartographer's task of
placing text labels adjacent to map features in such a way as to
minimize overlaps with other labels and map features.
Cartographic label placement is one of the most time-consuming
tasks in the production of maps. Consequently, several attempts
have been made to automate the label-placement task for some or
all classes of cartographic features (punctual, linear, or areal
features), but all previously published algorithms for the most
basic task---point-feature-label placement---either exhibit worst-
case exponential time complexity, or incorporate incomplete
heuristics that may fail to find an admissible labeling even when
one exists. The computational complexity of label placement is
therefore a matter of practical significance in automated
cartography. We show that admissible label placement is NP-
complete, even for very simple versions of the problem. Thus, no
polynomial time algorithm exists unless $P=NP$. Similarly, we show
that optimal label placement can be solved in polynomial time if
and only if $P=NP$, and this result holds even if we require only
approximately optimal placements. The results are especially
interesting because cartographic label placement is one of the few
combinatorial problems that remains NP-hard even under a geometric
(Euclidean) interpretation. The results are of broader practical
significance, as they also apply to point-feature labeling in non-
cartographic displays, e.g., the labeling of points in a scatter
plot.
%X tr-05-91.ps.gz
%R TR-06-91
%D 1991
%T A Graphical Editor for Three-Dimensional Constraint-Based\\
Geometric Modeling
%A Steven John Sistare
%X The design of geometric models can be a painstaking and time-
consuming task. Typical CAD packages are often primitive or
deficient in the means they offer for placing geometry in a design
or for subsequently modifying the geometry. Modification in
particular can tax a user's patience when it requires many
deletion, creation, or perturbation operations to effect a
conceptually simple change in the design. One area of research
that attempts to address these deficiencies involves the use of
constraints on the geometry as a means of both specifying and
controlling its shape. The form in which the constraint
information is demanded from the user determines the ease of use
of any geometric system that is based on constraints, and is one
of the basic problems to be addressed in the design of such a
system. I present a constraint-based geometric editor that allows
the manipulation of both constraints and geometry using the direct
-manipulation paradigm, which is well established as being of
central importance in many easy-to-use systems. When using the
editor, constraints are presented to the user graphically, in the
context of the geometric design, and may be created, destroyed,
and manipulated interactively along with the geometry. The
constraints may either be created explicitly, or implicitly as a
side effect of creating geometry. In addition, constraints are
used in a novel way to facilitate interactive creation and
positioning of geometry in three space, despite the limitations of
commonly-available two-dimensional display and input devices.
Lastly, whenever geometry is modified using direct manipulation, a
solver is called which updates the geometry in accordance with the
existing constraints. All of these features contribute to the
ease of use of my system. I also present a solver that addresses
another basic problem inherent in constraint-based systems;
namely, the need to efficiently obtain a solution that
instantiates the geometry so as to satisfy the constraints. I
provide a robust and efficient solver that is $O(n^2)$ in the size
of the geometric problem being solved and present the mathematics
that support it. In addition, I present a new algorithm that
partitions the free-form geometry and constraint network into a
number of pieces that may be independently solved. For many
networks, this algorithm yields a solution to the entire network
in close to linear time.
%R TR-07-91
%D 1991
%T Tight Upper and Lower Bounds on the Path Length\\ of Binary Trees
%A Alfredo De Santis
%A Giuseppe Persiano
%X The {\em external path length} of a tree $T$ is the sum of the
lengths of the paths from the root to each external node. The
{\em maximal path length difference,} $\triangle$, is the
difference between the length of the longest and shortest such
path. We prove tight lower and upper bounds on the external path
length of binary trees with $N$ external nodes and prescribed
maximal path length difference $\triangle$. In particular, we give
an upper bound that, for each value of $\triangle$, can be exactly
achieved for infinitely many values of $N$. This improves on the
previously known upper bound that could only be achieved up to a
factor proportional $N$. We then use the upper bound to give a
simple upper bound on the path length of Red-Black trees which is
asymptotically tight. We also present, as a preliminary result,
an elementary proof of the known upper bound. We finally prove a
lower bound which can be exactly achieved for each value of $N$
and $\triangle \leq N/2$.
%R TR-08-91
%D 1991
%T Abstract Semantics of First-Order Recursive Schemes
%A Robert Muller
%A Yuli Zhou
%X We develop a general framework for deriving abstract domains from
concrete semantic domains in the context of first-order recursive
schemes and prove several theorems which ensure the correctness
(safety) of abstract computations. The abstract domains, which we
call {\em Weak Hoare powerdomains} , subsume the roles of both
the abstract domains and the collecting interpretations in the
abstract interpretation literature.
%R TR-09-91
%D 1991
%T Semantic Domains for Abstract Interpretation
%A Robert Muller
%A Yuli Zhou
%X In this paper we consider abstract interpretation of PCF programs.
The main development is the extension of {\em weak powerdomains}
to higher types. In the classical abstract interpretation
approach, abstract domains are constructed explicitly and the
abstract semantics is then related to the concrete semantics. In
the approach introduced here, abstract domains are {\em derived}
directly from concrete domains. The conditions for deriving the
domains are intended to be as general as possible while still
guaranteeing that the derived domain has sufficient structure so
that it can be used as a basis for computing correct information
about the concrete semantics. We prove three main theorems, the
last of which ensures the correctness of abstract interpretation
of PCF programs given safe interpretations of the constants. This
generalizes earlier results obtained for the special case of
strictness analysis.
%R TR-10-91
%D 1991
%T Performance Modeling of Distributed Memory Architectures
%A S. Lennart Johnsson
%X We provide performance models for several primitive operations on
data structures distributed over memory units interconnected by a
Boolean cube network. In particular, we model single source, and
multiple source concurrent broadcasting or reduction, concurrent
gather and scatter operations, shifts along several axes of multi-
dimensional arrays, and emulation of butterfly networks. We also
show how the processor configuration, data aggregation, and the
encoding of the address space affect the performance for two
important basic computations: the multiplication of arbitrarily
shaped matrices, and the Fast Fourier Transform. We also give an
example of the performance behavior for local matrix operations
for a processor with a single path to local memory, and a set of
registers. The analytic models are verified by measurements on
the Connection Machine model CM-2.
%R TR-11-91
%D 1991
%T Outerjoins---How to Extend a Conventional Optimizer
%A C\'{e}sar Galindo-Legaria
%A Arnon Rosenthal
%X Free choice among join orderings is one of the most powerful
optimizations in a conventional optimizer. But the freedom is
limited to Select/Project/Join queries. In this paper, we extend
this freedom to queries that include outerjoins. Unlike previous
work, these results are not limited to queries possessing a ``nice
structure,'' or queries that are nicely represented in relational
calculus. Our theoretical results concern query
``simplification'' and reassociation using a generalized
outerjoin. We show how the necessary computation can be added
rather easily to the join-order generation of a conventional query
optimizer.
%R TR-12-91
%D 1991
%T An Algorithm for Plan Recognition in Collaborative Discourse
%A Karen E. Lochbaum
%X A model of plan recognition in discourse must be based on intended
recognition, distinguish each agent's beliefs and intentions from
the other's, and avoid assumptions about the correctness or
completeness of the agents' beliefs. In this paper, we present an
algorithm for plan recognition that is based on the SharedPlan
model of collaboration and that satisfies these constraints.
%R TR-13-91
%D 1991
%T Object-Oriented Programming for Massively Parallel Machines
%A Michael F. Kilian
%X Large, robust massively parallel programs that are understandable
(and therefore maintainable) are not yet a reality. Such programs
require a programming methodology that minimizes the conceptual
differences between the program and the domain addressed by the
program, encourages reusability, and still produces robust
programs that are readily maintained and reasoned about. This
paper proposes the parallel object-oriented model. The model is
constructed from an object-oriented methodology augmented by
constructs and semantics for parallel processing, and satisfies
the requirements for building large parallel applications. It
presents a unique way of representing object references and of
managing concurrent access to objects. The methodology may be
extended for a wide range of computing platforms and application
areas.
%R TR-14-91
%D 1991
%T Plan Recognition in Collaborative Discourse
%A Karen E. Lochbaum
%X A model of plan recognition in discourse must be based on intended
recognition, distinguish each agent's beliefs and intentions from
the other's, and avoid assumptions about the correctness or
completeness of the agents' beliefs. In this paper, we present an
algorithm for plan recognition that is based on the SharedPlan
model of collaboration and that satisfies these constraints.
%R TR-15-91
%D 1991
%T Elliptic Curves in Computer Science:\\ Primality Testing,
Factoring, and Cryptography
%A Michael Mitzenmacher
%R TR-16-91
%D 1991
%T Identifiability is Closed under Embeddings in Read-Once\\ Formulas
or $\mu$-Decision Trees
%A Thomas R. Hancock
%X We show a general positive result that allows us to boost the
expressiveness of projection closed classes of boolean functions
that are identifiable with membership and equivalence queries.
Such classes include monotone DNF formulas, read-once formulas,
conjunctions of Horn clauses, switch configurations, $\mu$-formula
decision trees, and read-twice DNF formulas. We show that when
representations from such classes (rather than single literals)
are tested at the leaves of a read-once formula, or on the
internal nodes of a $\mu$-decision tree, the resulting
representation class is still identifiable with membership and
equivalence queries. The additional overhead in time and queries
is polynomial.
%R TR-17-91
%D 1991
%T Computational Complexity of a Problem in Molecular Structure
Prediction
%A J. Thomas Ngo
%A Joe Marks
%X The computational task of protein-structure prediction is believed
to require exponential time, but previous arguments as to its
intractability have taken into account only the size of a
protein's conformational space. Such arguments do not rule out
the possible existence of an algorithm, more selective than
exhaustive search, that is efficient and exact. (An {\em
efficient} algorithm is one that is guaranteed, for all possible
inputs, to run in time bounded by a function polynomial in the
problem size. An {\em intractable} problem is one for which no
efficient algorithm exists.) Questions regarding the possible
intractability of problems are often best answered using the
theory of NP-completeness. In this treatment we show the NP-
hardness of two typical mathematical statements of empirical
potential energy function minimization for macromolecules. Unless
all NP-complete problems can be solved efficiently, these results
imply that a function-minimization algorithm can be efficient for
protein-structure prediction only if it exploits protein-specific
properties that prohibit the simple geometric constructions that
we use in our proofs. Analysis of further mathematical statements
of molecular structure prediction could constitute a systematic
methodology for identifying sources of complexity in protein
folding, and for guiding development of predictive algorithms.
%X tr-17-91.ps.gz
%R TR-18-91
%D 1991
%T Optimal All-to-All Personalized Communication with Minimum Span on
Boolean Cubes
%A S. Lennart Johnsson
%A Ching-Tien Ho
%X All-to-all personalized communication is a class of permutations
in which each processor sends a unique message to every other
processor. We present optimal algorithms for concurrent
communication on all channels in Boolean cube networks, both for
the case with a single permutation, and the case where multiple
permutations shall be performed on the same local data set, but on
different sets of processors. For \ $ K $ \ elements per
processor our algorithms give the optimal number of elements
transfer, \ $ K/2. $ \ For a succession of all-to-all personalized
communications on disjoint subcubes of \ $ \beta $ \ dimensions
each, our best algorithm yields \ $ \frac {K} {2} + \sigma - \beta
$ \ element exchanges in sequence, where \ $ \sigma $ \ is the
total number of processor dimensions in the permutation. An
implementation on the Connection Machine of one of the algorithms
offers a maximum speed-up of 50\% compared to the previously best
known algorithm.
%R TR-19-91
%D 1991
%T Matrix Multiplication on Hypercubes Using Full Bandwidth and
Constant Storage
%A Ching-Tien Ho
%A S. Lennart Johnsson
%A Alan Edelman
%X For matrix multiplication on hypercube multiprocessors with the
product matrix accumulated in place a processor must receive about
\ $ P^ {2} /\sqrt {N} $ \ elements or each input operand, with
operands of size \ $ P \times P $ \ distributed evenly over \ $ N
$ \ processors. With concurrent communication on all ports, the
number of element transfers in sequence can be reduced to \ $ P^
{2} /\sqrt {N} \log N $ \ for each input operand. We present a
two-level partitioning of the matrices and an algorithm for the
matrix multiplication with optimal data motion and constant
storage. The algorithm has sequential arithmetic complexity \ $
2P^ {3} , $ \ and parallel arithmetic complexity \ $ 2P^ {3} /N. $
\ The algorithm has been implemented on the Connection Machine
model CM-2. For the performance on the 8K CM-2, we measured about
1.6 Gflops, which would scale up to about 13 Gflops for a 64K full
machine.
%R TR-20-91
%D 1991
%T On the Conversion between Binary Code and Binary-Reflected Gray
Code on Boolean Cubes
%A S. Lennart Johnsson
%A Ching-Tien Ho
%X We present a new algorithm for conversion between binary code and
binary-reflected Gray code that requires approximately \ $ \frac
{2K} {3} $ \ element transfers in sequence for \ $ K $ \ elements
per node, compared to \ $ K $ \ element transfers for previously
known algorithms. For a Boolean cube of \ $ n = 2 $ \ dimensions
the new algorithm degenerates to yield a complexity of \ $ \frac
{K} {2} + 1 $ \ element transfers, which is optimal. The new
algorithm is optimal within a factor of \ $ \frac {1} {3} $ \ for
any routing strategy. We show that the minimum number of element
transfers for minimum path length routing is \ $ K $ \ with
concurrent communication on all channels of every node of a
Boolean cube.
%R TR-21-91
%D 1991
%T All-to-All Broadcast and Applications on the Connection\\ Machine
%A Jean-Philippe Brunet
%A S. Lennart Johnsson
%X An all-to-all broadcast routing algorithm that allows concurrent
communication on all channels of the Connection Machine Boolean
cube network is described. Explicit routing formulas are given for
both the physical broadcast between processors, and the virtual
broadcast within processors. Implementation issues are addressed
and timings for the physical and virtual broadcast are given for
the Connection Machine system CM-2. The peak data transfer rate
for the physical broadcast on a 64k CM-2 is 4.1 Gbytes/sec, and
the peak rate for the virtual broadcast is about 20 Gbytes/sec.
Reshaping of arrays is shown experimentally to reduce the
broadcast time by a factor of up to 7 by reducing the amount of
local data motion. Finally, we also show how to exploit symmetry
for computation of an interaction matrix using the all-to-all
broadcast function. Further optimizations are suggested for \ $N-
$body type calculations. Using the all-to-all broadcast function,
a peak rate of 5 Gflops/s has been achieved for the \ $ N-$body
computations in 32-bit precision on a 64k CM-2.
%R TR-22-91
%D 1991
%T New Approaches to Automating Network-Diagram Layout
%A Corey Kosak
%A Joe Marks
%A Stuart Shieber
%X Network diagrams are a familiar graphic form that can express many
different kinds of information. The problem of automating network
-diagram layout has therefore received much attention. Previous
research on network-diagram layout has focused on the problem of
aesthetically optimal layout, using such criteria as the number of
link crossings, the sum of all link lengths, and total diagram
area. In this paper we propose a restatement of the network-
diagram-layout problem in which layout-aesthetic concerns are
subordinated to perceptual-organization concerns. We describe a
notation for describing the visual organization of a network
diagram. This notation is used in reformulating the layout task
as a constrained-optimization problem in which constraints are
derived from a visual-organization specification and optimality
criteria are derived from layout-aesthetic considerations. Two
new heuristic algorithms are presented for this version of the
layout problem: one algorithm uses a rule-based strategy for
computing a layout; the other is a massively parallel genetic
algorithm. We demonstrate the capabilities of the two algorithms
by testing them on a variety of network-diagram-layout problems.
%R TR-23-91
%D 1991
%T Minimizing the Communication Time for Matrix Multiplication on
Multi-Processors
%A S. Lennart Johnsson
%X We present a few algorithms that allow concurrency in
communication on multiple channels of multi-processors to be
exploited for the multiplication of matrices of arbitrary shapes.
For multi-processors configured as \ $ n-$dimensional Boolean
cubes our algorithms offer a speedup of the communication over
previous algorithms for square matrices and square cubes by a
factor of \ $ \frac {n} {2} . $ \ We show that configuring \ $ N $
\ processors as a three-dimensional array may reduce the
communication complexity by a factor of \ $ \sqrt[6] {N} $ \
compared to the two-dimensional partitioning. The best two-
dimensional configuration of the multi-processor nodes has a ratio
between the number of rows and columns equal to the ratio between
the number of rows and columns of the product matrix. The optimum
three dimensional configuration has a ratio between the length of
the machine axes equal to the ratio between the length of the
three axis in matrix multiplication. For product matrices of
extreme shape a one-dimensional partitioning may be optimum. All
presented algorithms use standard communication functions.
%R TR-24-91
%D 1991
%T Cooley-Tukey FFT on the Connection Machine
%A S. Lennart Johnsson
%A Robert L. Krawitz
%X We describe an implementation of the Cooley Tukey complex-to-
complex FFT on the Connection Machine. The implementation is
designed to make effective use of the communications bandwidth of
the architecture, its memory bandwidth, and storage with
precomputed twiddle factors. The peak data motion rate that is
achieved for the interprocessor communication stages is in excess
of 7 Gbytes/s for a Connection Machine system CM-200 with 2048
floating-point processors. The peak rate of FFT computations local
to a processor is 12.9 Gflops/s in 32-bit precision, and 10.7
Gflops/s in 64-bit precision. The same FFT routine is used to
perform both one- and multi-dimensional FFT without any explicit
data rearrangement. The peak performance for one-dimensional FFT
on data distributed over all processors is 5.4 Gflops/s in 32-bit
precision and 3.2 Gflops/s in 64-bit precision. The peak
performance for square, two-dimensional transforms, is 3.1
Gflops/s in 32-bit precision, and for cubic, three dimensional
transforms, the peak is 2.0 Gflops/s in 64-bit precision. Certain
oblong shapes yield better performance. The number of twiddle
factors stored in each processor is \ $ \frac {P} {2N} + \log_ {2}
N $ \ for an FFT on \ $ P $ \ complex points uniformly distributed
among \ $ N $ \ processors. To achieve this level of storage
efficiency we show that a decimation-in-time FFT is required for
normal order input, and a decimation-in-frequency FFT is required
for bit-reversed input order.
%R TR-25-91
%D 1991
%T Communication Efficient Multi-processor FFT
%A S. Lennart Johnsson
%A Michel Jacquemin
%A Robert L. Krawitz
%X Computing the Fast Fourier Transform on a distributed memory
architecture by a direct pipelined radix-2 algorithm, a bi-section
or multi-section algorithm, all yield the same communications
requirement, if communication for all FFT stages can be performed
concurrently, the input date is in normal order, and the data
allocation consecutive. With a cyclic data allocation, or bit-
reversed input data and a consecutive allocation, multi-sectioning
offers a reduced communications requirement by approximately a
factor of two. For a consecutive data allocation, normal input
order, a decimation-in-time FFT requires that \ $ \frac {P} {N} +
d - 2 $ \ twiddle factors be stored for \ $ P $ \ elements
distributed evenly over \ $ N $ \ processors and the axis subject
to transformation distributed over \ $ 2^ {d} $ \ processors. No
communication of twiddle factors is required. The same storage
requirements hold for a decimation-in-frequency FFT, bit-reversed
input order, and consecutive data allocation. The opposite
combination of FFT type and data ordering requires a factor of \ $
\log_ {2} N $ \ more storage for \ $ N $ \ processors. The peak
performance for a Connection Machine system CM-200 implementation
is 12.9 Gflops/s in 32-bit precision, and 10.7 Gflops/s in 64-bit
precision for unordered transforms local to each processor. The
corresponding execution rates for ordered transforms are 11.1.
Gflops/s and 8.5 Gflops/s, respectively. For distributed one- and
two-dimensional transforms the peak performance for unordered
transforms exceeds 5 Gflops/s in 32-bit precision, and 3 Gflops/s
in 64-bit precision. Three-dimensional transforms executes at a
slightly lower rate. Distributed ordered transforms executes at a
rate of about \ $ \frac {1} {2} $ \ to \ $ \frac {2} {3} $ \ of
the unordered transforms.
%R TR-26-91
%D 1991
%T Learning Nonoverlapping Perceptron Networks From Examples and
Membership Queries
%A Thomas R. Hancock
%A Mostefa Golea
%A Mario Marchand
%X We investigate, within the PAC learning model, the problem of
learning nonoverlapping perceptron networks. These are loop-free
neural nets in which each node has only one outgoing weight. We
give a polynomial time algorithm that PAC learns any
nonoverlapping perceptron network using examples and membership
queries. The algorithm is able to identify both the architecture
and the weight values necessary to represent the function to be
learned. Our results shed some light on the effect of the overlap
on the complexity of learning in neural networks.
%R TR-27-91
%D 1991
%T Labeling Point Features on Maps and Diagrams Using Simulated
Annealing
%A Jon Christensen
%A Joe Marks
%A Stuart Shieber
%X A major factor affecting the clarity of graphical displays is the
degree to which text labels obscure display features (including
other labels) as a result of spatial overlap. Point-feature label
placement (PFLP) is the problem of placing text labels adjacent to
point features on a map or diagram in order to maximize
legibility. This problem arises for all kinds of informational
graphics, though it is most often associated with automated
cartography. In this paper we present a comprehensive treatment
of the PFLP problem. First, we summarize some recent results
regarding the computational complexity of PFLP. These results show
that optimal PFLP is NP-hard. Second, we survey previously
reported algorithms for PFLP. Third, we describe a stochastic-
optimization method for PFLP, based on simulated annealing.
Finally, we present the results of an empirical comparison of the
known algorithms for PFLP. Our results indicate that the
simulated-annealing approach to PFLP is superior to all existing
methods, regardless of label density.
%R TR-28-91
%D 1991
%T Linearizable Read/Write Objects
%A Marios Mavronicolas
%A Dan Roth
%X We study the cost of implementing {\em linearizable} read/write
objects for shared-memory multiprocessors and under various
assumptions on the available timing information. We take as cost
measure the {\em worst-case response time} of performing an
operation in distributed implementations of virtual shared memory
consisting of such objects and supporting linearizability. It is
assumed that processes have clocks that run at the same rate as
real time and all messages incur a delay in the range $[d-u,d]$
for some known constants $u$ and $d$, $0 \leq u \leq d$. In the
{\em perfect clocks} model, where processes have perfectly
synchronized clocks and every message incurs a delay of exactly
$d$, we present a family of optimal linearizable implementations,
parameterized by a constant $\beta$, $0 \leq \beta \leq 1$, for
which the worst-case response times for read and write operations
are $\beta d$ and $(1-\beta)d$, respectively. The parameter
$\beta$ may be appropriately chosen to account for the relative
frequencies of read and write operations. Our main result is the
first known linearizable implementation for the {\em imperfect
clocks} model where clocks are not initially synchronized and
message delays can vary, i.e., $u > 0$; it achieves worst-case
response times of less than $4u+b$ ($b>0$ is an arbitrarily small
constant) and $d+3u$ for read and write operations, respectively.
This implementation uses novel synchronization techniques to
exploit the lower bound on message delay time and achieve bounds
on worst-case response times that depend on the message delay
uncertainty $u$. For a wide range of values of $u$, these bounds
improve previously known ones for implementations that support
consistency conditions even weaker than linearizability.
%R TR-01-92
%D 1992
%T Multiplication of Matrices of Arbitrary Shape on a Data\\ Parallel
Computer
%A Kapil K. Mathur
%A S. Lennart Johnsson
%X Some level-2 and level-3 Distributed Basic Linear Algebra
Subroutines (DBLAS) that have been implemented on the Connection
Machine system CM-200 are described. For matrix-matrix
multiplication, both the nonsystolic and the systolic algorithms
are outlined. A systolic algorithm that computes the product
matrix in-place is described in detail. All algorithms that are
presented here are part of the Connection Machine Scientific
Software Library, CMSSL. We show that a level-3 DBLAS yields
better performance than a level-2 DBLAS. On the Connection
Machine system CM-200, blocking yields a performance improvement
by a factor of up to three over level-2 DBLAS. For certain matrix
shapes the systolic algorithms offer both improved performance and
significantly reduced temporary storage requirements compared to
the nonsystolic block algorithms. The performance improvement
over the blocked nonsystolic algorithms may be as much as a factor
of seven, or more than a factor of 20 over the level-2 DBLAS. We
show that, in order to minimize the communication time, an
algorithm that leaves the largest operand matrix stationary should
be chosen for matrix-matrix multiplication. Furthermore, it is
shown both analytically and experimentally that the optimum shape
of the processor array yields square stationary submatrices in
each processor, i.e., the ratio between the length of the axes of
the processing array must be the same as the ratio between the
corresponding axes of the stationary matrix. The optimum
processor array shape may yield a factor of five performance
enhancement for the multiplication of square matrices. For
rectangular matrices a factor of 30 improvement was observed for
an optimum processor array shape compared to a poorly chosen
processor array shape.
%R TR-02-92
%D 1992
%T A Data Parallel Finite Element Method for Computational Fluid
Dynamics on the Connection Machine System
%A Zden\u { e } k Johan
%A Thomas J.R. Hughes
%A Kapil K. Mathur
%A \\
%A S. Lennart Johnsson
%X A finite element method for computational fluid dynamics has been
implemented on the Connection Machine systems CM-2 and CM-200. An
implicit iterative solution strategy, based on the preconditioned
matrix-free GMRES algorithm, is employed. Parallel data
structures built on both nodal and elemental sets are used to
achieve maximum parallelization. Communication primitives
provided through the Connection Machine Scientific Software
Library substantially improved the overall performance of the
program. Computations of three-dimensional compressible flows
using unstructured meshes having close to one million elements,
such as a complete airplane, demonstrate that the Connection
Machine systems are suitable for these applications. Performance
comparisons are also carried out with the vector computers Cray Y-
MP and Convex C-1.
%R TR-03-92
%D 1992
%T Efficiency of Semi-Synchronous versus Asynchronous Systems:
Atomic Shared Memory
%A Marios Mavronicolas
%X The {\em $s$-session problem} is studied in {\em asynchronous}
and {\em semi-synchronous} shared-memory systems, under a
particular shared-memory communication primitive -- $k$-writer,
$k$-reader atomic registers, -- where $k$ is a constant reflecting
the communication bound in the model. A session is a part of an
execution in which each of $n$ processes takes at least one step;
an algorithm for the $s$-session problem guarantees the existence
of at least $s$ disjoint sessions. The existence of many sessions
guarantees a degree of interleaving which is necessary for certain
computations. In the asynchronous model, it is assumed that the
time between any two consecutive steps of any process is in the
interval $[0,1]$; in the semi-synchronous model, the time between
any two consecutive steps of any process is in the interval
$[c,1]$ for some $c$ such that $0 < c \leq 1$, the {\em
synchronous} model being the special case where $c=1$. All
processes are initially synchronized and take a step at time $0$.
Our main result is a tight (within constant factors) lower bound
of $1 + \min \{\lfloor \frac {1} {2c} \rfloor, \lfloor \log_ {k}
(n-1) - 1 \rfloor \} (s-2)$ for the time complexity of any semi-
synchronous algorithm for the $s$-session problem. This result
implies a time separation between semi-synchronous and
asynchronous shared memory systems.
%R TR-04-92
%D 1992
%T Block-Cyclic Dense Linear Algebra
%A Woody Lichtenstein
%A S. Lennart Johnsson
%X Block-cyclic order elimination algorithms for LU and QR
factorization and solve routines are described for distributed
memory architectures with processing nodes configured as two-
dimensional arrays of arbitrary shape. The cyclic order
elimination together with a consecutive data allocation yields
good load-balance for both the factorization and solution phases
for the solution of dense systems of equations by LU and QR
decomposition. Blocking may offer a substantial performance
enhancement on architectures for which the level-2 or level-3 BLAS
are ideal for operations local to a processing node. High rank
updates local to a node may have a performance that is a factor of
four or more higher than a rank-1 update. We show that in our
implementation the \ $ O(N^ {2} )$ \ work in the factorization is
of the same significance as the \ $ O(N^ {3} ) $ \ work, even for
large matrices, because the \ $ O(N^ {2} ) $ \ work is poorly load
-balanced in the two-dimensional processor array configuration.
However, we show that the two-dimensional processor array
configuration with consecutive data allocation and block-cyclic
order elimination is optimal with respect to communication for a
simple, but fairly general communications model. In our Connection
Machine system CM-200 implementation, the peak performance for LU
factorization is about 9.4 Gflops/s in 64-bit precision and 16
Gflops/s in 32-bit precision. Blocking offers an overall
performance enhancement of about a factor of two. For the data
motion, use is made of the fact that the nodes along each axis of
the two-dimensional array are interconnected as Boolean cubes.
%R TR-05-92
%D 1992
%T Efficient, Strongly Consistent Implementations\\ of Shared Memory
%A Marios Mavronicolas
%A Dan Roth
%X We present two distributed organizations of multiprocessor shared
memory and develop for them implementations that are shown to
satisfy a strong consistency condition, namely {\em
linearizability} , achieve improvements in efficiency over
previous ones that support even weaker consistency conditions and
possess other important, sought after properties that make them
practically attractive. It is assumed throughout this paper that
processes have clocks that run at the same rate as real time and
all messages incur a delay in the range $[d-u,d]$ for some known
constants $u$ and $d$, $0 \leq u \leq d$. The efficiency of an
implementation is measured by the {\em worst-case response time}
for performing an operation on an object. For the {\em full
caching} organization, where each process keeps local copies of
all objects, we present the first efficient linearizable
implementation of read/write objects. The family of linearizable
implementations we present is parameterized in a way that allows
to degrade the less frequently employed operation, and is shown to
be essentially optimal. For the {\em single ownership}
organization, each shared object is ``owned'' by a single process,
which is most likely to access it frequently. We present an
implementation, that allows a process to access local information
much faster (almost instantaneously) than it can access remote
information, while still supporting linearizability. While the
cost of the global operations depends on the maximal message delay
$d$, the cost of the local operations depends only on the message
delay uncertainty $u$. In both implementations, decisions made by
individual processes do not make use of any communicated timing
information. In particular, timing information is not part of the
messages passed by our protocols, and those are of bounded size.
These two organizations can be combined in a hierarchical memory
structure, which supports linearizability very efficiently; this
hybrid structure allows processes to access local and remote
information in a transparent manner, while at a lower level of the
memory consistency system, different portions of the memory,
allocated a priory according to anticipated remote versus local
use of the objects, employ the suitable, full caching or single
ownership implementation.
%R TR-06-92
%D 1992
%T Lecture Notes on Domain Theory
%A Robert Muller
%X This report contains a collection of lecture notes for a series of
lectures introducing Scott's {\em domain theory} . The basic
structure of semantic domains and fixpoint theory are introduced.
The lecture notes are not intended to serve as a primary
reference but rather as a supplement to a more comprehensive
treatment.
%R TR-07-92
%D 1992
%T Index Transformation Algorithms in a Linear\\ Algebra Framework
%A Alan Edelman
%A Steve Heller
%A S. Lennart Johnsson
%X We present a linear algebraic formulation for a class of index
transformations such as Gray code encoding and decoding, matrix
transpose, bit reversal, vector reversal, shuffles, and other
index or dimension permutations. This formulation unifies,
simplifies, and can be used to derive algorithms for hypercube
multiprocessors. We show how all the widely known properties of
Gray codes, and some not so well-known properties as well, can be
derived using this framework. Using this framework, we relate
hypercube communications algorithms to Gauss-Jordan elimination on
a matrix of 0's and l's.
%R TR-08-92
%D 1992
%T An Alternative Conception of Tree-Adjoining Derivation
%A Yves Schabes
%A Stuart M. Shieber
%X The precise formulation of derivation for tree-adjoining grammars
has important ramifications for a wide variety of uses of the
formalism, from syntactic analysis to semantic interpretation and
statistical language modeling. We argue that the definition of
tree-adjoining derivation must be reformulated in order to
manifest the proper linguistic dependencies in derivations. The
particular proposal is both precisely characterizable, through a
compilation to linear indexed grammars, and computationally
operational, by virtue of an efficient algorithm for recognition
and parsing.
%X tr-08-92.ps.gz
%R TR-09-92
%D 1992
%T Local Basic Linear Algebra Subroutines (LBLAS) for\\ Distributed
Memory Architectures and Languages with\\ Array Syntax
%A S. Lennart Johnsson
%A Luis F. Ortiz
%X We describe a subset of the level-1, level-2, and level-3 BLAS
implemented for each node of the Connection Machine system CM-200
and with a set of interfaces consistent with Fortran 90. The
implementation performs computations on multiple instances in a
single call to a routine. The strides for the different axes are
derived from an array descriptor that contains information about
the length of the axes, the number of instances and their
allocation in the machine. Another novel feature of our
implementation of the BLAS in each node is a selection of loop
order for rank-1 updates and matrix-matrix multiplication based
upon array shapes, strides, and DRAM page faults. The peak
efficiencies for the routines are in the range 75\% to 90\%. The
optimization of loop ordering has a success rate exceeding 99.8\%
for matrices for which the sum of the length of the axes is at
most 60. The success rate is even higher for all possible matrix
shapes. The performance loss when a nonoptimal choice is made is
less than $ \sim$15\% of peak, and typically less l\% of peak. We
also show that the performance gain for high rank updates may be
as much as a factor of 6 over rank-1 updates.
%R TR-10-92
%D 1992
%T Direct Bulk-Synchronous Parallel Algorithms
%A Alexandros V. Gerbessiotis
%A Leslie G. Valiant
%X We describe a methodology for constructing parallel algorithms
that are transportable among parallel computers having different
numbers of processors, different bandwidths of interprocessor
communication and different periodicity of global synchronization.
We do this for the bulk-synchronous parallel (BSP) model, which
abstracts the characteristics of a parallel machine into three
numerical parameters $p$, $g$, and $L$, corresponding to
processors, bandwidth, and periodicity respectively. The model
differentiates memory that is local to a processor from that which
is not, but, for the sake of universality, does not differentiate
network proximity. The advantages of this model in supporting
shared memory or PRAM style programming have been treated
elsewhere. Here we emphasize the viability of an alternative
direct style of programming where, for the sake of efficiency the
programmer retains control of memory allocation. We show that
optimality to within a multiplicative factor close to one can be
achieved for the problems of Gauss-Jordan elimination and sorting,
by transportable algorithms that can be applied for a wide range
of values of the parameters $p$, $g$, and $L$. We also give some
simulation results for PRAMs on the BSP to identify the level of
slack at which corresponding efficiencies can be approached by
shared memory simulations, provided the bandwidth parameter $g$ is
good enough.
%R TR-11-92
%D 1992
%T Deriving Parallel and Systolic Programs from Data Dependence
%A Lilei Chen
%X We present an algorithm that statically sequences data
computations and communications for parallel and systolic
executions. Instead of searching for implicit parallelism in a
functional or sequential program, the algorithm looks for sequence
requirements imposed by the data dependence and by the
communication delays. It achieves its efficiency by analyzing the
sequence constraints. The actual sequence of the parallel
computations can be decided at the last stage to tailor to the
specific parallel machine. In addition, once the processor
mapping is decided, the data communication delay is combined with
the computation sequence to obtain the final scheduling of the
computations and communications. As a result, the parallel
implementation fully explores the parallelism in the original
program and effectively schedules the computations and minimizes
the communication cost by systolic design. Because the algorithm
is only based on the data dependence from the original program, it
can be applied to a wide variety of program forms, from sequential
loop programs with updates, to recursive equation sets. It can
detect parallelism in sequential programs, as well as provide
efficient implementations for recurrence statements in equation
sets.
%R TR-12-92
%D 1992
%T Algebraic Optimization of Outerjoin Queries
%A C\'{e}sar Alejandro Galindo-Legaria
%X The purpose of this thesis is to extend database optimization
techniques for joins to queries that contain both joins and
outerjoins. The benefits of query optimization are thus extended
to a number of important applications, such as federated
databases, nested queries, and hierarchical views, for which
outerjoin is a key component. Our analysis of join/outerjoin
queries is done in two parts. First, we investigate the
interaction of outerjoin with other relational operators, to find
simplification rules and associativity identities. Our approach
is comprehensive and includes, as special cases, some outerjoin
optimization heuristics that have appeared in the literature.
Second, we abstract the notion of feasible evaluation order for
binary, join-like operators, considering associativity rules but
not specific operator semantics. Combining these two parts, we
show that a join/outerjoin query can be evaluated by combining
relations in any given order ---just as it is done on join
queries, except that now we need to synthesize an operator to use
at each step, rather than using always join. The purpose of
changing the order of processing of relations is to reduce the
size of intermediate results. Our theoretical results are
converted into algorithms compatible with the architecture of
conventional database optimizers. For optimizers that repeatedly
transform feasible strategies, the outerjoin identities we have
identified can be applied directly. Those identities are
sufficient to obtain all possible orders of processing. For
optimizers that generate join programs bottom-up, we give a rule
to determine the operators to use at each step.
%R TR-13-92
%D 1992
%T Timing-Based, Distributed Computation:\\ Algorithms and
Impossibility Results
%A Marios Mavronicolas
%X Real distributed systems are subject to timing uncertainties:
processes may lack a common notion of real time, or may even have
only inexact information about the amount of real time needed for
performing primitive computation steps. In this thesis, we embark
on a study of the complexity theory of such systems and present
combinatorial results that determine the inherent costs of some
accomplishable tasks. We first consider {\em continuous-time}
models, where processes obtain timing information from continuous-
time clocks that run at the same rate as real time, but might not
be initially synchronized. Due to an uncertainty in message delay
time, absolute process synchronization is known to be impossible
for such systems. We develop novel synchronization schemes for
such systems and use them for building a distributed, {\em full
caching} implementation of shared memory that supports {\em
linearizability} . This implementation improves in efficiency over
previous ones that support consistency conditions even weaker than
linearizability and supports a quantitative degradation of the
less frequently occurring operation. We present lower bound
results which show that our implementation achieves efficiency
close to optimal. We next turn to {\em discrete-time} models,
where the time between any two consecutive steps of a process is
in the interval $[c,1]$, for some constant $c$ such that $0 \leq c
\leq 1$. We show time separation results between {\em
asynchronous} and {\em semi-synchronous} such models, defined by
taking $c=0$ and $c > 0$, respectively. Specifically, we use the
{\em session problem} to show that the semi-synchronous model, for
which the timing uncertainty, $\frac {1} {c} $, is bounded, is
strictly more powerful than the asynchronous one under either
message-passing or shared-memory interprocess communication. We
also present tight lower and upper bounds on the degree of {\em
precision} that can be achieved in the semi-synchronous model.
Our combinatorial results shed some light on the capabilities and
limitations of distributed systems subject to timing
uncertainties. In particular, the main argument of this thesis is
that the goal of designing distributed algorithms so that their
logical correctness is timing-independent, whereas their
performance might depend on timing assumptions, will not always be
achievable: for some tasks, the only practical solutions might be
strongly timing-dependent.
%R TR-14-92
%D 1992
%T An Upper and a Lower Bound for Tick Synchronization
%A Marios Mavronicolas
%X The {\em tick synchronization problem} is defined and studied in
the {\em semi-synchronous} network, where $n$ processes are
located at the nodes of a complete graph and communicate by
sending messages along links that correspond to its edges. An
algorithm for the tick synchronization problem brings each process
into a synchronized state in which the process makes an estimate
of real time that is close enough to those of other processes
already in a synchronized state. It is assumed that the (real)
time for message delivery is at most $d$ and the time between any
two consecutive steps of any process is in the interval $[c,1]$
for some $c$ such that $0 < c \leq 1$. We define the {\em
precision} of a tick synchronization algorithm to be the maximum
difference between estimates of real time made by different
processes in a synchronized state, and propose it as a worst-case
performance measure. We show that no such algorithm can guarantee
precision less than $\lfloor \frac {d-2} {2c} \rfloor$. We also
present an algorithm which achieves a precision of $\frac {2(n-1)}
{n} (\lceil \frac {2d} {c} \rceil + \frac {d} {2} ) + \frac {1-c}
{c} d+1$.
%R TR-15-92
%D 1992
%T The Complexity of Learning Formulas and Decision Trees that have
Restricted Reads
%A Thomas Raysor Hancock
%X Many learning problems can be phrased in terms of finding a close
approximation to some unknown target formula $f$, based on
observing $f$'s value on a sample of points either drawn at random
according to some underlying distribution, or perhaps selected by
a learner for algorithmic reasons. In this research our goal is
to prove theorems about what classes of formulas permit such
learning in polynomial time (using the definitions of either
Valiant's PAC model or Angluin's exact identification model). In
particular we take powerful classes of formulas whose learnability
is unknown or provably intractable, and then consider restricted
cases where the number of different times a single variable may
appear in the formula is limited to a small constant. We prove
positive learnability results in several such cases, given either
added assumptions on the underlying distribution of random points
or the ability of the learner to select some of the sample points.
We provide polynomial time learning algorithms for decision trees
and monotone disjunctive normal form (DNF) formulas when variables
appear at most some arbitrary constant number of times, given that
the sample points are chosen uniformly. Over arbitrary
distributions, we show algorithms that chose their own sample
points, besides using random examples, to closely approximate the
same class of decision trees and the class of DNF formulas where
variables appear at most twice. For arbitrary formulas, we give a
number of algorithms for the read-once case (where variables
appear only once) over different bases (the functions computed at
formula's nodes). Besides identification algorithms for large
classes of boolean read-once formulas, these results include new
interpolation algorithms for classes of rational functions, and a
membership query algorithm for a new class of neural networks.
%R TR-16-92
%D 1992
%T Optimal Communication Channel Utilization for \\ Matrix
Transposition and Related Permutations\\ on Binary Cubes
%A S. Lennart Johnsson
%A Ching-Tien Ho
%X We present optimal schedules for permutations in which each node
sends one or several unique messages to every other node. With
concurrent communication on all channels of every node in binary
cube networks, the number of element transfers in sequence for \ $
K $ \ elements per node is \ $ {K \over 2} , $ \ irrespective of
the number of nodes over which the data set is distributed. For a
succession of \ $ s $ \ permutations within disjoint subcubes of \
$ d $ \ dimensions each, our schedules yield \ $ \min ( {K \over
2} + (s - 1)d,(s + 3)d, {K \over 2} + 2 d) $ \ exchanges in
sequence. The algorithms can be organized to avoid indirect
addressing in the internode data exchanges, a property that
increases the performance on some architectures. For message
passing communication libraries, we present a blocking procedure
that minimizes the number of block transfers while preserving the
utilization of the communication channels. For schedules with
optimal channel utilization, the number of block transfers for a
binary \ $ d-$cube is \ $ d. $ The maximum block size for \ $ K $
\ elements per node is \ $ \lceil {K \over 2d} \rceil. $
%R TR-17-92
%D 1992
%T Parallel Sets: An Object-Oriented Methodology for Massively
Parallel Programming
%A Michael Francis Kilian
%X Parallel programming has become the focus of much research in the
past decade. As the limits of VLSI technology are tested, it
becomes more apparent that parallel processors will be responsible
for the next quantum leap in performance. Already parallel
programming is responsible for significant advances not so much in
the speed of solving problems, but in the size of problems that
can be solved. Carefully crafted parallel programs are solving
problems magnitudes larger than could be considered for serial
machines. Object-oriented programming has also become popular in
academia and perhaps even moreso in industry. O-O holds out the
promise of being able to efficiently build large systems that are
understandable, maintainable, and more robust. The programs
targetted by O-O are different than those typically found running
on a computer such as the Connection Machine. Parallel programs
are often designed for very specific tasks; O-O programs'
strengths are that they handle a wide variety of requirements. The
thesis proposed here is that an object-oriented model of
programming can be developed that is suitable for massively
parallel processors. A set of criteria are developed for object-
oriented parallel programming models and existing models are
evaluated using this criteria. Given these criteria, the thesis
presents a new way of thinking of parallel programs that builds
upon an object-oriented foundation. A new basic type is added to
the object model called Parallel-Set. Parallel sets are
rigorously defined and then used to express complex communication
between objects. The communication model is then extended to
allow communication and synchronization protocols to be developed.
The contribution of this work is that a wider range of reliable
programs can be designed for use on parallel computers and that
these programs will be easier to construct and understand.
%R TR-18-92
%D 1992
%T Language and Compiler Issues in Scalable High\\ Performance
Scientific Libraries
%A S. Lennart Johnsson
%X Library functions for scalable architectures must be designed to
correctly and efficiently support any distributed data structure
that can be created with the supported languages and associated
compiler directives. Libraries must be designed also to support
concurrency in each function evaluation, as well as the concurrent
application of the functions to disjoint array segments, known as
{\em multiple-instance} computation. Control over the data
distribution is often critical for locality of reference, and so
is the control over the interprocessor data motion. Scalability,
while preserving efficiency, implies that the data distribution,
the data motion, and the scheduling is adapted to the object
shapes, the machine configuration, and the size of the objects
relative to the machine size. The Connection Machine Scientific
Software Library is a scalable library for distributed data
structures. The library is designed for languages with an array
syntax. It is accessible from all supported languages ($\ast$Lisp,
C$\ast$,CM-Fortran, and Paris (PARalel Instruction Set) in
combination with Lisp, C, and Fortran 77). Single library calls
can manage both concurrent application of a function to disjoint
array segments, as well as concurrency in each application of a
function. The control of the concurrency is independent of the
control constructs provided in the high-level languages. Library
functions operate efficiently on any distributed data structure
that can be defined in the high-level languages and associated
directives. Routines may use their own internal data distribution
for efficiency reasons. The algorithm invoked by a call to a
library function depends upon the shapes of the objects involved,
their sizes and distribution, and upon the machine shape and size.
%R TR-19-92
%D 1992
%T Lessons from a Restricted Turing Test
%A Stuart M. Shieber
%X We report on the recent Loebner prize competition inspired by
Turing's test of intelligent behavior. The presentation covers
the structure of the competition and the outcome of its first
instantiation in an actual event, and an analysis of the purpose,
design, and appropriateness of such a competition. We argue that
the competition has no clear purpose, that its design prevents any
useful outcome, and that such a competition is inappropriate given
the current level of technology. We then speculate as to suitable
alternatives to the Loebner prize.
%X tr-19-92.ps.gz
%R TR-20-92
%D 1992
%T An Efficient Algorithm for Gray-to-Binary Permutation\\ on
Hypercubes
%A Ching-Tien Ho
%A M.T. Raghunath
%A S. Lennart Johnsson
%X Both Gray code and binary code are frequently used in mapping
arrays into hypercube architectures. While the former is
preferred when communication between adjacent array elements is
needed, the latter is preferred for FFT-type communication. When
different phases of computations have different types of
communication patterns, the need arises to remap the data. We
give a nearly optimal algorithm for permuting data from a Gray
code mapping to a binary code mapping on a hypercube with
communication restricted to one input and one output channel per
node at a time. Our algorithm improves over the best previously
known algorithm [6] by nearly a factor of two and is optimal to
within a factor of \ $ n/(n-1) $ \ with respect to data transfer
time on an \ $n$-cube. The expected speedup is confirmed by
measurements on an Intel iPSC/2 hypercube.
%R TR-21-92
%D 1992
%T Physically Realistic Trajectory Planning in Animation:\\ A
Stimulus-Response Approach
%A J. Thomas Ngo
%A Joe Marks
%X Trajectory-planning problems arise in subject to physical law other
constraints on their motion. Witkin Kass dubbed this class of
problems ``Spacetime Constraints'' (SC) presented results for specific
problems involving an articulated f SC problems are typically
multimodal discontinuous the number of decision alternatives available
at each time step ca constructing even coarse trajectories for
subsequent optimization without directive input from the user. Rather
than use a time-dom which might be appropriate for local optimization
our algorithm uses a stimulus-response model. Locomotive skills a
which chooses stimulus-response parameters using a parallel geneti
succeeds in finding good novel solutions for a test suite of SC
problems involving unbranch
%R TR-22-92
%D 1992
%T Pronouns, Names, and the Centering of Attention in Discourse
%A Peter C. Gordon
%A Barbara J. Grosz
%A Laura A. Gilliom
%X Centering theory, developed within computational linguistics,
provides an account of ways in which patterns of inter-utterance
reference can promote the local coherence of discourse. It states
that each utterance in a coherent discourse segment contains a
single semantic entity -- the backward-looking center -- that
provides a link to the previous utterance, and an ordered set of
entities -- the forward-looking centers -- that offer potential
links to the next utterance. We report five reading-time
experiments that test predictions of this theory with respect to
the conditions under which it is preferable to realize (refer to)
an entity using a pronoun rather than a repeated definite
description or name. The experiments show that there is a single
backward-looking center that is preferentially realized as a
pronoun, and that the backward- looking center is typically
realized as the grammatical subject of the utterance. They also
provide evidence that there is a set of forward-looking centers
that is ranked in terms of prominence and that a key factor in
determining prominence, surface-initial position, does not affect
determination of the backward-looking center. This provides
evidence for the dissociation of the coherence processes of
looking backward and looking forward.
%R TR-23-92
%D 1992
%T Communication Primitives for Unstructured Finite Element
Simulations on Data Parallel Architectures
%A Kapil K. Mathur
%A S. Lennart Johnsson
%X Efficient data motion is critical for high performance computing
on distributed memory architectures. The value of some techniques
for efficient data motion is illustrated by identifying generic
communication primitives. Further, the efficiency of these
primitives is demonstrated on three different applications using
the finite element method for unstructured grids and sparse
solvers with different communication requirements. For the
applications presented, the techniques advocated reduced the
communication times by a factor of between 1.5 - 3.
%R TR-24-92
%D 1992
%T A Combining Mechanism for Parallel Computers
%A Leslie G. Valiant
%X In a multiprocessor computer communication among the components
may be based either on a simple router, which delivers messages
point-to-point like a mail service, or on a more elaborate
combining network that, in return for a greater investment in
hardware, can combine messages to the same address prior to
delivery. This paper describes a mechanism for recirculating
messages in a simple router so that the added functionality of a
combining network, for arbitrary access patterns, can be achieved
by it with reasonable efficiency. The method brings together the
messages with the same destination address in more than one stage,
and at a set of components that is determined by a hash function
and decreases in number at each stage.
%R TR-25-92
%D 1992
%T Labeling Point Features on Maps and Diagrams
%A Jon Christensen
%A Joe Marks
%A Stuart Shieber
%X A major factor affecting the clarity of graphical displays that
include text labels is the degree to which labels obscure display
features (including other labels) as a result of spatial overlap.
Point-feature label placement (PFLP) is the problem of placing
text labels adjacent to point features on a map or diagram so as
to maximize legibility. This problem occurs frequently in the
production of many types of informational graphics, though it
arises most often in automated cartography. In this paper we
present a comprehensive treatment of the PFLP problem, viewed as a
type of combinatorial optimization problem. Complexity analysis
reveals that the basic PFLP problem and most interesting variants
of it are NP-hard. We review these results briefly. These
negative results help inform a survey of previously reported
algorithms for PFLP; not surprisingly, all such algorithms either
have exponential time complexity or are incomplete. To solve the
PFLP problem in practice, then, we must rely on good heuristic
methods. We propose two new methods, one based on a discrete form
of gradient descent, the other on simulated annealing, and report
on a series of empirical tests comparing these and the other known
algorithms for the problem. Our results indicate that the
simulated annealing approach to PFLP is superior to all existing
methods, regardless of label density.
%X tr-25-92.ps.gz
%R TR-26-92
%D 1992
%T Why BSP Computers?
%A L.G. Valiant
%R TR-27-92
%D 1992
%T Variations on Incremental Interpretation
%A Stuart M. Shieber
%A Mark Johnson
%X tr-27-92.ps.gz
%R TR-28-92
%D 1992
%T A Fixpoint Theory of Nonmonotonic Functions and \\ Its
Applications to Logic Programs, Deductive Databases\\ and
%A Yuli Zhou
%X In this thesis we shall employ denotational (fixpoint) methods to
study the computations of rule systems based on first order logic.
The resulting theory parallels and further strengthens the
fixpoint theory of {\it stratified logic programs} developed by
Apt, Blair and Walker, and we shall consider two principal
applications of the theory to logic programs and to production
rule systems.
%R TR-29-92
%D 1992
%T Massively Parallel Computing: Data distribution and\\
communication
%A S. Lennart Johnsson
%X We discuss some techniques for preserving locality of reference in
index spaces when mapped to memory units in a distributed memory
architecture. In particular, we discuss the use of
multidimensional address spaces instead of linearized address
spaces, partitioning of irregular grids, and placement of
partitions among nodes. We also discuss a set of communication
primitives we have found very useful on the Connection Machine
systems in implementing scientific and engineering applications.
We briefly review some of the techniques used to fully utilize the
bandwidth of the binary cube network of the CM-2 and CM-200, and
give some performance data from implementations of commumication
primitives.
%R TR-30-92
%D 1992
%T Optimal Computing on Mesh-Connected Processor Arrays
%A Christos Ioannis Kaklamanis
%X In this thesis, we present and analyze new algorithms for routing,
sorting and dynamic searching on mesh-connected arrays of
processors; also we present a lower bound concerning embeddings on
faulty arrays. In particular, we first consider the problem of
permutation routing in two- and three-dimensional mesh-connected
processor arrays. We present new on-line and off-line routing
algorithms, all of which are optimal to within a small additive
term. Then, we show that sorting an input of size $N = n^2$ can be
performed by an $n \times n$ mesh-connected processor array in $2n
+ o(n)$ parallel communication steps and using constant-size
queues, with high probability. This result is optimal to within a
low order additive term, realizing the obvious diameter lower
bound. Our techniques can be applied to higher dimensional meshes
as well as torus-connected networks, achieving significantly
better bounds than the known results. Furthermore, we investigate
the parallel complexity of the backtrack and branch-and-bound
search on the mesh-connected array. We present an $\Omega(\sqrt
{dN} /\sqrt {\log N} )$ lower bound for the time needed by a {\em
randomized} algorithm to perform backtrack and branch-and-bound
search of a tree of depth $d$ on the $\sqrt {N} \times \sqrt {N} $
mesh, even when the depth of the tree is known in advance. For
the upper bounds we give {\em deterministic} algorithms that are
within a factor of $O(\log^ {{3} \over {2}} N)$ from our lower
bound. Our algorithms do not make any assumption on the shape of
the tree to be searched. Our algorithm for branch-and-bound is
the first algorithm that performs branch-and-bound search on a
sparse network. Both the lower and the upper bounds extend to
higher dimension meshes.
%R TR-31-92
%D 1992
%T An Algebraic Approach to the Compilation and Operational\\
Semantics of Functional Languages with I-structures
%A Zena Matilde Ariola
%X Modern languages are too complex to be given direct operational
semantics. For example, the operational semantics of functional
languages has traditionally been given by translating them to the
$\lambda$-calculus extended with constants. Compilers do a
similar translation into an intermediate form in the process of
generating code for a machine. A compiler then performs
optimizations on this intermediate form before generating machine
code. In this thesis we show that the intermediate form can
actually be the kernel language. In fact, we may translate the
kernel language into still lower-level language(s), where more
machine oriented or efficiency related concerns can be expressed
directly. Furthermore, compiler optimizations may be expressed as
source-to-source transformations on the intermediate languages.
We introduce two implicitly parallel languages, Kid (Kernel Id)
and P-TAC (Parallel Three Address Code), respectively, and
describe the compilation process of Id in terms of a translation
of Id into Kid, and of Kid into P-TAC\@. In this thesis we do not
describe the compilation process below the P-TAC level. However,
we show that our compilation process allows the formalization of
questions related to the correctness of the optimizations. We
also give the operational semantics of Id indirectly by its
translation into Kid and a well-defined operational semantics for
Kid. Kid and P-TAC are examples of Graph Rewriting Systems
(GRSs), which are introduced to capture sharing of computation
precisely. Sharing of subexpressions is important both
semantically (e.g., to model side-effects) and pragmatically
(e.g., to reason about complexity). Our GRSs extend Barendregt's
Term Graph Rewriting Systems to include cyclic graphs and cyclic
rules. We present a term model for GRSs along the lines of
L\'evy's term model for $\lambda$-calculus, and show its
application to compiler optimizations. We also show that GRS
reduction is a correct implementation of term rewriting.
%R TR-01-93
%D 1993
%T Massively Parallel Computing: \\ Mathematics and communications
libraries
%A S. Lennart Johnsson
%A Kapil K. Mathur
%X Massively parallel computing holds the promise of extreme
performance. The utility of these systems will depend heavily upon
the availability of libraries until compilation and run-time
system technology is developed to a level comparable to what today
is common on most uniprocessor systems. Critical for performance
is the ability to exploit locality of reference and effective
management of the communication resources. We discuss some
techniques for preserving locality of reference in distributed
memory architectures. In particular, we discuss the benefits of
multidimensional address spaces instead of the conventional
linearized address spaces, partitioning of irregular grids, and
placement of partitions among nodes. Some of these techniques are
supported as language directives, others as run-time system
functions, and others still are part of the Connection Machine
Scientific Software Library, CMSSL. We briefly discuss some of
the unique design issues in this library for distributed memory
architectures, and some of the novel ideas with respect to
managing data allocation, and automatic selection of algorithms
with respect to performance. The CMSSL also includes a set of
communication primitives we have found very useful on the
Connection Machine systems in implementing scientific and
engineering applications. We briefly review some of the
techniques used to fully utilize the bandwidth of the binary cube
network of the CM-2 and CM-200 Connection Machine systems.
%R TR-02-93
%D 1993
%T All-to-All Communication on the Connection Machine CM-200
%A Kapil K. Mathur
%A S. Lennart Johnsson
%X Detailed algorithms for all-to-all broadcast and reduction are
given for arrays mapped by binary or binary-reflected Gray code
encoding to the processing nodes of binary cube networks.
Algorithms are also given for the local computation of the array
indices for the communicated data, thereby reducing the demand for
communications bandwidth. For the Connection Machine system CM-
200, Hamiltonian cycle based all-to-all communication algorithms
yields a performance that is a factor of two to ten higher than
the performance offered by algorithms based on trees, butterfly
networks, or the Connection Machine router. The peak data rate
achieved for all-to-all broadcast on a 2048 node Connection
Machine system CM-200 is 5.4 Gbytes/sec when no reordering is
required. If the time for data reordering is included, then the
effective peak data rate is reduced to 2.5 Gbytes/sec.
%R TR-03-93
%D 1993
%T Topics in Parallel and Distributed Computation
%A Alexandros Gerbessiotis
%X With advances in communication technology, the introduction of
multiple-instruction multiple-data parallel computers and the
increasing interest in neural networks, the fields of parallel and
distributed computation have received increasing attention in
recent years. We study in this work the bulk-synchronous parallel
model, that attempts to bridge the software and hardware worlds
with respect to parallel computing. It offers a high level
abstraction of the hardware with the purpose of allowing parallel
programs to run efficiently on diverse hardware platforms. We
examine direct algorithms on this model and also give simulations
of other models of parallel computation on this one as well as on
models that bypass it. While the term parallel computation refers
to the execution of a single or a set of closely coupled tasks by
a set of processors, the term distributed computation refers to
more loosely coupled or uncoupled tasks being executed at
different locations. In a distributed computing environment it is
sometimes necessary that one computer send to the remaining ones
various pieces of information. The term broadcasting is used to
describe the dissemination of information from one computer to the
others in such an environment. We examine various classes of
random graphs with respect to broadcasting and establish results
related to the minimum time required to perform broadcasting from
any vertex of such graphs. Various models of the human brain, as
a collection of distributed elements working in parallel, have
been proposed. Such elements are connected together in a network.
The network in the human brain is sparse. How such sparse networks
of simple elements can perform any useful computation is a topic
currently little understood. We examine in this work a graph
construction problem as it relates to neuron allocation in a model
of neural networks recently proposed. We also examine a certain
class of random graphs with respect to this problem and establish
various results related to the distribution of the sizes of sets
of neurons when various learning tasks are performed on this
model. Experimental results are also presented and compared to
theoretically derived ones.
%R TR-04-93
%D 1993
%T Infrastructure for Research towards Ubiquitous Information\\
Systems
%A Barbara Grosz
%A H.T. Kung
%A Margo Seltzer
%A Stuart Shieber\\
%A Michael Smith
%R TR-05-93
%D 1993
%T An Equational Framework for the Flow Analysis of Higher Order
Functional Programs
%A Dan Stefanescu
%A Yuli Zhou
%X This paper presents a novel technique for the static analysis of
functional programs. The method uses the original Cousots'
framework expanded by a syntactic based abstraction methodology.
The main idea is to represent each computational entity in a
functional program in relation to its (concrete) call string, i.e.
the string of function calls leading to its computation.
Furthermore, the abstraction criteria consists in choosing a
relation of equivalence over the set of all call strings. Based on
this relation of equivalence, the method generates a monotonic
system of equations such that its least solution is the desired
result of the analysis. This approach generalizes previous
techniques (0CFA, 1CFA, etc.) in flow analyses and allows for
program-directed design of frameworks for approximate analysis of
programs. The method is proven correct with respect to a rewriting
system based operational semantics.
%R TR-06-93
%D 1993
%T An Efficient Communication Strategy for Finite Element\\ Methods
on the Connection Machine CM-5 System
%A Zden\v { e } k Johan
%A Kapil K. Mathur
%A S. Lennart Johnsson
%A Thomas J.R. Hughes
%X Performance of finite element solvers on parallel computers such
as the Connection Machine CM-5 system is directly related to the
efficiency of the communication strategy. The objective of this
work is two-fold: First, we propose a data-parallel implementation
of a partitioning algorithm used to decompose unstructured meshes.
The mesh partitions are then mapped to the vector units of the CM-
5. Second, we design gather and scatter operations taking
advantage of data locality coming from the decomposition to reduce
the communication time. This new communication strategy is
available in the CMSSL [8]. An example illustrates the performance
of the proposed strategy.
%R TR-07-93
%D 1993
%T All-to-All Communication Algorithms for Distributed BLAS
%A Kapil K. Mathur
%A S. Lennart Johnsson
%X Dense Distributed Basic Linear Algebra Subroutine (DBLAS)
algorithms based on all-to-all broadcast and all-to-all reduce are
presented. For DBLAS, at each all-to-all step, it is necessary to
know the data values and the indices of the data values as well.
This is in contrast to the more traditional applications of all-to
-all broadcast (such as a N-body solver) where the identity of the
data values is not of much interest. Detailed schedules for all-
to-all broadcast and reduction are given for the data motion of
arrays mapped to the processing nodes of binary cube networks
using binary encoding and binary-reflected Gray encoding. The
algorithms compute the indices for the communicated data locally.
No communication bandwidth is consumed for data array indices.
For the Connection Machine system CM-200, Hamiltonian cycle based
all-to-all communication algorithms improves the performance by a
factor of two to ten over a combination of tree, butterfly
network, and router based algorithms. The data rate achieved for
all-to-all broadcast on a 256 node Connection Machine system CM-
200 is 0.3 Gbytes/sec. The data motion rate for all-to-all
broadcast, including the time for index computations and local
data reordering, is about 2.8 Gbytes/sec for a 2048 node system.
Excluding the time for index computation and local memory
reordering the measured data motion rate for all-to-all broadcast
is 5.6 Gbytes/s. On a Connection Machine system, CM-200, with 2048
processing nodes, the overall performance of the distributed
matrix vector multiply (DGEMV) and vector matrix multiply (DGEMV
with TRANS) is 10.5 Gflops/s and 13.7 Gflops/s respectively.
%R TR-08-93
%D 1993
%T Massively Parallel Computing: \\ Unstructured Finite Element
Simulations
%A Kapil K. Mathur
%A Zden\v { e } k Johan
%A S. Lennart Johnsson
%A Thomas J.R. Hughes
%X Massively parallel computing holds the promise of extreme
performance. Critical for achieving high performance is the
ability to exploit locality of reference and effective management
of the communication resources. This article describes two
communication primitives and associated mapping strategies that
have been used for several different unstructured, three-
dimensional, finite element applications in computational fluid
dynamics and structural mechanics.
%R TR-09-93
%D 1993
%T The Connection Machine Systems CM-5
%A S. Lennart Johnsson
%X The Connection Machine system CM-5 is a parallel computing system
scalable to Tflop performance, hundreds of Gbytes of primary
storage, Tbytes of secondary storage and Gbytes/s of I/O
bandwidth. The system has been designed to be scalable over a
range of up to three orders of magnitude. We will discuss the
design goals, innovative software and hardware features of the CM-
5 system, and some experience with the system.
%R TR-10-93
%D 1993
%T Constraint-Driven Diagram Layout
%A Ed Dengler
%A Mark Friedell
%A Joe Marks
%X Taking both perceptual organization and aesthetic criteria into
account is the key to high-quality diagram layout, but makes for a
more difficult problem than pure aesthetic layout. Computing the
layout of a network diagram that exhibits a specified perceptual
organization can be phrased as a constraint-satisfaction problem.
Some constraints are derived from the perceptual-organization
specification: the nodes in the diagram must be positioned so that
they form specified perceptual gestalts, i.e., certain groups of
nodes must form perceptual groupings by proximity, or symmetry, or
shape motif, etc. Additional constraints are derived from
aesthetic considerations: the layout should satisfy criteria that
concern the number of link crossings, the sum of link lengths, or
diagram area, etc. Using a generalization of a simple mass-spring
layout technique to ``satisfice'' constraints, we show how to
produce high-quality layouts with specified perceptual
organization for medium-sized diagrams (10--30 nodes) in under 30
seconds on a workstation.
%X tr-10-93.ps.gz
%R TR-11-93
%D 1993
%T An Efficient Communication Strategy for Finite Element\\ Methods
on the Connection Machine CM-5 System
%A Zden\v { e } k Johan
%A Kapil K. Mathur
%A S. Lennart Johnsson
%A Thomas J.R. Hughes
%X The objective of this paper is to propose communication procedures
suitable for unstructured finite element solvers implemented on
distributed-memory parallel computers such as the Connection
Machine CM-5 system. First, a data-parallel implementation of the
recursive spectral bisection (RSB) algorithm proposed by Pothen
{\em et al.} is presented. The RSB algorithm is associated with a
node renumbering scheme which improves data locality of reference.
Two-step gather and scatter operations taking advantage of this
data locality are then designed. These communication primitives
make use of the indirect addressing capability of the CM-5 vector
units to achieve high gather and scatter bandwidths. The
efficiency of the proposed communication strategy is illustrated
on large-scale three-dimensional fluid dynamics problems.
%R TR-12-93
%D 1993
%T Aligning Sentences in Bilingual Corpora Using Lexical\\
Information
%A Stanley F. Chen
%X In this paper, we describe a fast algorithm for aligning sentences
with their translations in a bilingual corpus. Existing efficient
algorithms ignore word identities and only consider sentence
length (Brown {\it et al.} , 1991; Gale and Church 1991). Our
algorithm constructs a simple statistical word-to-word translation
model on the fly during alignment. We find the alignment that
maximizes the probability of generating the corpus with this
translation model. We have achieved an error rate of
approximately 0.4\% on Canadian Hansard data, which is a
significant improvement over previous results. The algorithm is
language independent.
%R TR-13-93
%D 1993
%T Compaction and Separation Algorithms for Non-Convex Polygons and
Their Applications
%A Zhenyu Li
%A Victor Milenkovic
%X Given a two dimensional, non-overlapping layout of convex and non-
convex polygons, compaction can be thought of as simulating the
motion of the polygons as a result of applied ``forces.'' We apply
compaction to improve the material utilization of an already
tightly packed layout. Compaction can be modeled as a motion of
the polygons that reduces the value of some functional on their
positions. Optimal compaction, planning a motion that reaches a
layout that has the global minimum functional value among all
reachable layouts, is shown to be NP-complete under certain
assumptions. We first present a compaction algorithm based on
existing physical simulation approaches. This algorithm uses a
new velocity-based optimization model. Our experimental results
reveal the limitation of physical simulation: even though our new
model improves the running time of our algorithm over previous
simulation algorithms, the algorithm still can not compact typical
layouts of one hundred or more polygons in a reasonable amount of
time. The essential difficulty of physical based models is that
they can only generate velocities for the polygons, and the final
positions must be generated by numerical integration. We present
a new position-based optimization model that allows us to
calculate directly new polygon positions via linear programming
that are at a local minimum of the objective. The new model
yields a translational compaction algorithm that runs two orders
of magnitude faster than physical simulation methods. We also
consider the problem of separating overlapping polygons using a
minimal amount of motion and show it to be NP-complete. Although
this separation problem looks quite different from the compaction
problem, our new model also yields an efficient algorithm to solve
it. The compaction/separation algorithms have been applied to
marker making: the task of packing polygonal pieces on a sheet of
cloth of fixed width so that total length is minimized. The
compaction algorithm has improved cloth utilization of human
generated pants markers. The separation algorithm together with a
database of human-generated markers can be used for automatic
generation of markers that approach human performance.
%X tr-13-93.ps.gz
%R TR-14-93
%D 1993
%T Universal Boolean Judges and Their Characterization
%A Eyal Kushilevitz
%A Silvio Micali
%A Rafail Ostrovsky
%X We consider the classic problem of $n$ honest (but curious)
players with private inputs $x_1,\ldots, x_n$ who wish to compute
the value of some pre-determined function $f(x_1,\ldots,x_n)$, so
that at the end of the protocol every player knows the value of
$f(x_1,\ldots,x_n)$. The players have unbounded computational
resources and they wish to compute $f$ in a totally {\em
private\/} ($n$-private) way. That is, after the completion of the
protocol, which all players honestly follow, no coalition (of
arbitrary size) can infer any information about the private inputs
of the remaining players above of what is already been revealed by
the value of $f(x_1,\ldots,x_n)$. Of course, with the help of a
{\em trusted judge for computing $f$} , players can trivially
compute $f$ in a totally private manner: every player secretly
gives his input to the trusted judge and she announces the result.
Previous research was directed towards implementing such a judge
``mentally'' by the players themselves, and was shown possible
under various assumptions. Without assumptions, however, it was
shown that most functions {\em can not\/} be computed in a
totally private manner and thus we must rely on a trusted judge.
If we have a trusted judge for $f$ we are done. Can we use a
judge for a ``simpler'' function $g$ in order to compute $f$ $n$-
privately? In this paper we initiate the study of the {\em
complexity} of such judges needed to achieve total privacy for
arbitrary $f$. We answer the following two questions: {\em How
complicated such a judge should be, compared to $f$?} and {\em
Does there exists some judge which can be used for all $f$?} We
show that there exists {\bf universal boolean} judges (i.e. the
ones that can be used for any $f$) and give a complete
characterization of all the boolean functions which describe
universal judges. In fact, we show, that a judge computing {\em
any\/} boolean function $g$ which itself cannot be computed $n$-
privately (i.e., when there is no judge available) is {\em
universal\/} . Thus, we show that for all boolean functions, the
notions of {\bf universality\/} and {\bf $n$-privacy} are {\em
complimentary\/} . On the other hand, for non-boolean functions,
we show that this two notions are {\em not\/} complimentary. Our
result can be viewed as a strong generalization of the two-party
case, where Oblivious Transfer protocols were shown to be
universal.
%R TR-15-93
%D 1993
%T An Unbundled Compiler
%A Thomas Cheatham
%R TR-16-93
%D 1993
%T Actions, Beliefs and Intentions in Multi-Action Utterances
%A Cecile Tiberghien Balkanski
%X Multi-action utterances convey critical information about agents'
beliefs and intentions with respect to the actions they talk about
or perform. Two such utterances may, for example, describe the
same actions while the speakers of these utterances hold beliefs
about these actions that are diametrically opposed. Hence, for a
language interpretation system to understand multi-action
utterances, it must be able (1) to determine the actions that are
described and the ways in which they are related, and (2) to draw
appropriate inferences about the agents' mental states with
respect to these actions and action relations. This thesis
investigates the semantics of two particular multi-action
constructions: utterances with means clauses and utterances with
rationale clauses. These classes of utterances are of interest
not only as exemplars of multi-action utterances, but also because
of the subtle differences in information that can be felicitously
inferred from their use. Their meaning is shown to depend on the
beliefs and intentions of the speaker and agents whose actions are
being described as well as on the actions themselves. Thus, the
thesis demonstrates (a) that consideration of mental states cannot
be reserved to pragmatics and (b) that other aspects of natural
language interpretation besides the interpretation of mental state
verbs or plan recognition may provide information about mental
states. To account for this aspect of natural language
interpretation, this thesis presents a theory of logical form, a
theory of action and action relations, an axiomatization of belief
and intention, and interpretation rules for means clauses and
rationale clauses. Together these different pieces constitute an
interpretation model that meets the requirements specified in (1)
and (2) above and that predicts the set of beliefs and intentions
shown to be characteristic of utterances with means clauses and
rationale clauses. This model has been implemented in the MAUI
system (Multi-Action Utterance Interpreter), which accepts natural
language sentences from a user, computes their logical form, and
answers questions about the beliefs and intentions of the speaker
and actor regarding the actions and action relations described.
%R TR-17-93
%D 1993
%T Stochastic Approximation Algorithms for Number Partitioning
%A Wheeler Ruml
%X This report summarizes research on algorithms for finding
particularly good solutions to instances of the NP-complete number
-partitioning problem. Our approach is based on stochastic search
algorithms, which iteratively improve randomly chosen initial
solutions. Instead of searching the space of all $2^ {n-1} $
possible partitionings, however, we use these algorithms to
manipulate indirect encodings of candidate solutions. An encoded
solution is evaluated by a decoder, which interprets the encoding
as instructions for constructing a partitioning of a given problem
instance. We present several different solution encodings,
including bit strings, permutations, and rule sets, and describe
decoding algorithms for them. Our empirical results show that many
of these encodings restrict and reshape the solution space in ways
that allow relatively generic search methods, such as hill
climbing, simulated annealing, and the genetic algorithm, to find
solutions that are often as good as those produced by the best
known constructive heuristic, and in many cases far superior. For
the algorithms and representations we consider, the choice of
solution representation plays an even greater role in determining
performance than the choice of search algorithm.
%X tr-17-93.ps.gz
%R TR-18-93
%D 1993
%T POLYSHIFT Communications Software for the Connection Machine
System CM-200
%A William George
%A Ralph G. Brickner
%A S. Lennart Johnsson
%X We describe the use and implementation of a polyshift function
{\bf PSHIFT} for circular shifts and end-off shifts. Polyshift is
useful in many scientific codes using regular grids, such as
finite difference codes in several dimensions, multigrid codes,
molecular dynamics computations, and in lattice gauge physics
computations, such as quantum Chromodynamics (QCD) calculations.
Our implementation of the {\bf PSHIFT} function on the Connection
Machine systems CM-2 and CM-200 offers a speedup of up to a factor
of 3-4 compared to {\bf CSHIFT} when the local data motion within
a node is small. The {\bf PSHIFT} routine is included in the
Connection Machine Scientific Software Library (CMSSL).
%R TR-19-93
%D 1993
%T High Performance, Scalable Scientific Software Libraries
%A S. Lennart Johnsson
%A Kapil K. Mathur
%X Massively parallel processors introduces new demands on software
systems with respect to performance, scalability, robustness and
portability. The increased complexity of the memory systems and
the increased range of problem sizes for which a given piece of
software is used, poses serious challenges to software developers.
The Connection Machine Scientific Software Library, CMSSL, uses
several novel techniques to meet these challenges. The CMSSL
contains routines for managing the data distribution and provides
data distribution independent functionality. High performance is
achieved through careful scheduling of operations and data motion,
and through the automatic selection of algorithms at run-time. We
discuss some of the techniques used, and provide evidence that
CMSSL has reached the goals of performance and scalability for an
important set of applications.
%R TR-20-93
%D 1993
%T A Collaborative Planning Approach to Discourse Understanding
%A Karen Lochbaum
%X Approaches to discourse understanding fall roughly into two
categories: those that treat mental states as elemental and thus
reason directly about them, and those that do not reason about the
beliefs and intentions of agents themselves, but about the
propositions and actions that might be considered objects of those
beliefs and intentions. The first type of approach is a mental
phenomenon approach, the second a data-structure approach
(Pollack, 1986b). In this paper, we present a mental phenomenon
approach to discourse understanding and demonstrate its advantages
over the data-structure approaches used by other researchers. The
model we present is based on the collaborative planning framework
of SharedPlans (Grosz and Sidner, 1990; Lochbaum, Grosz, and
Sidner, 1990; Grosz and Kraus, 1993). SharedPlans are shown to
provide a computationally realizable model of the intentional
component of Grosz and Sidner's theory of discourse structure
(Grosz and Sidner, 1986). Additionally, this model is shown to
simplify and extend approaches to discourse understanding that
introduce multiple types of plans to model an agent's motivations
for producing an utterance (Litman, 1985; Litman and Allen, 1987;
Ramshaw, 1991; Lambert and Carberry, 1991).
%R TR-21-93
%D 1993
%T Evolving Line Drawings
%A Ellie Baker
%A Margo Seltzer
%X This paper explores the application of interactive genetic
algorithms to the creation of line drawings. We have built a
system that starts with a collection of drawings that are either
randomly generated or input by the user. The user selects one such
drawing to mutate or two to mate, and a new generation of drawings
is produced by randomly modifying or combining the selected
drawing(s). This process of selection and procreation is repeated
many times to evolve a drawing. A wide variety of complex
sketches with highlighting and shading can be evolved from very
simple drawings. This technique has enormous potential for
augmenting and enhancing the power of traditional computer-aided
drawing tools, and for expanding the repertoire of the computer-
assisted artist.
%X tr-21-93.ps.gz
%R TR-22-93
%D 1993
%T A Stencil Compiler for the Connection Machine Models\\ CM-2/200
%A Ralph G. Brickner
%A William George
%A S. Lennart Johnsson
%A \\
%A Alan Ruttenberg
%X In this paper we present a Stencil Compiler for the Connection
Machine Models CM-2 and CM-200. A {\em stencil} is a weighted sum
of circularly-shifted CM Fortran arrays. The stencil compiler
optimizes the data motion between processing nodes, minimizes the
data motion within a node, and minimizes the data motion between
registers and local memory in a node. The compiler makes novel
use of the communication system and has highly optimized register
use. The compiler natively supports two-dimensional stencils, but
stencils in three or four dimensions are automatically decomposed.
Portions of the system are integrated as part of the CM Fortram
programming system, and also as part of the system microcode. The
compiler is available as part of the Connection Machine Scientific
Software Library (CMSSL) Release 3.1.
%R TR-23-93
%D 1993
%T CMSSL: A Scalable Scientific Software Library
%A S. Lennart Johnsson
%X Massively parallel processors introduce new demands on software
systems with respect to performance, scalability, robustness and
portability. The increased complexity of the memory systems and
the increased range of problem sizes for which a given piece of
software is used poses serious challenges for software developers.
The Connection Machine Scientific Software Library, CMSSL, uses
several novel techniques to meet these challenges. The CMSSL
contains routines for managing the data distribution and provides
data distribution independent functionality. High performance is
achieved through careful scheduling of operations and data motion,
and through the automatic selection of algorithms at run-time. We
discuss some of the techniques used, and provide evidence that
CMSSL has reached the goals of performance and scalability for an
important set of applications.
%R TR-24-93
%D 1993
%T Annotating Floor Plans Using Deformable Polygons
%A Kathy Ryall
%A Joe Marks
%A Murray Mazer
%A Stuart Shieber
%X The ability to recognize regions in a bitmap image has
applications in various areas, from document recognition of
scanned building floor plans to processing of scanned forms. We
consider the use of deformable polygons for delineating partially
or fully bounded regions of a scanned bitmap that depicts a
building floor plan. We discuss a semi-automated interactive
system, in which a user positions a seed polygon in an area of
interest in the image. The computer then expands and deforms the
polygon in an attempt to minimize an energy function that is
defined so that configurations with minimum energy tend to match
the subjective boundaries of regions in the image. When the
deformation process is completed, the user may edit the deformed
polygon to make it conform more closely to the desired region. In
contrast to area-filling techniques for delineating areal regions
of images, our approach works robustly for partially bounded
regions.
%X tr-24-93.ps.gz