# Harvard Computer Science Technical Reports for 1991

TR-01-91 [tr-01-91.ps.gz (66 K)]
Ralph G. Brickner, Clive F. Baillie, and S. Lennart Johnsson. 1991. QCD on the Connection Machine: Beyond $^*\hbox{LISP}$.''
We report on the status of code development for a simulation of Quantun Chromodynamics (QCD) with dynamical Wilson fermions on the Connection Machine model CM-2. Our original code, written in \*Lisp, gave performance in the near -GFLOPS range. We have rewritten the most time-consuming parts of the code in the low-level programming system CMIS, including the matrix multiply and the communication. Current versions of the code run at approximately 3.6 GFLOPS for the fermion matrix inversion, and we expect the next version to reach or exceed 5 GFLOPS.
TR-02-91 [tr-02-91.ps.gz (37 K)]
S. Lennart Johnsson and Patrick Worley. 1991. Communication and I/O Libraries.''

TR-03-91
S. Lennart Johnsson and Ching-Tien Ho. 1991. Optimal Communication Channel Utilization for Matrix Transposition and Related Permutations on Boolean Cubes .''
TR-16-92 SUPERCEDES TR-03-91
TR-04-91 [tr-04-91.ps.gz (99 K)]
S. Lennart Johnsson and Ching-Tien Ho. 1991. Generalized Shuffle Permutations on Boolean Cubes.''
In a {\em generalized shuffle permutation} an address \ $(a_ {q- 1} -a_ {q-2} ldots a_ {} )$ \ receives its content from an address obtained through a cyclic shift on a subset of the \ $q$ \ dimensions used for the encoding of the addresses. Bit- complementation may be combined with the shift. We give an algorithm that requires \ $\frac {K} {2} + 2$ \ exchanges for \ $K$ \ elements per processor, when storage dimensions are part of the permutation, and concurrent communication on all ports of every processor is possible. The number of element exchanges in sequence is independent of the number of processor dimensions \ $\sigma_ {r}$ \ in the permutation. With no storage dimensions in the permutation our best algorithm requires \ $(\sigma_ {r} + 1) lceil \frac {K} {2 \sigma_ {r}} \rceil$ \ element exchanges. We also give an algorithm for \ $\sigma_ {r} = 2,$ \ or the real shuffle consists of a number of cycles of length two, that requires \ $\frac {K} {2} + 1$ \ element exchanges in sequence when there no bit complement. The lower bound is \ $\frac {K} {2}$ \ for both real and mixed shuffles with no bit complementation. The minimum number of communication start-ups is \ $\sigma_ {r}$ \ for both cases, which is also the lower bound. The data transfer time for communication restricted to one port per processor is \ $\sigma_ {r} \frac {K} {2} ,$ \ and the minimum number of start-ups is \ $\sigma_ {r} .$ \ The analysis is verified by experimental results on the Intel iPSC/1, and for one case also on the Connection Machine.
TR-05-91 [tr-05-91.ps.gz (88 K)]
Joe Marks and Stuart Shieber. 1991. The Computational Complexity of Cartographic Label Placement.''
We examine the computational complexity of cartographic label placement, a problem derived from the cartographer's task of placing text labels adjacent to map features in such a way as to minimize overlaps with other labels and map features. Cartographic label placement is one of the most time-consuming tasks in the production of maps. Consequently, several attempts have been made to automate the label-placement task for some or all classes of cartographic features (punctual, linear, or areal features), but all previously published algorithms for the most basic task---point-feature-label placement---either exhibit worst- case exponential time complexity, or incorporate incomplete heuristics that may fail to find an admissible labeling even when one exists. The computational complexity of label placement is therefore a matter of practical significance in automated cartography. We show that admissible label placement is NP- complete, even for very simple versions of the problem. Thus, no polynomial time algorithm exists unless $P=NP$. Similarly, we show that optimal label placement can be solved in polynomial time if and only if $P=NP$, and this result holds even if we require only approximately optimal placements. The results are especially interesting because cartographic label placement is one of the few combinatorial problems that remains NP-hard even under a geometric (Euclidean) interpretation. The results are of broader practical significance, as they also apply to point-feature labeling in non- cartographic displays, e.g., the labeling of points in a scatter plot.
TR-06-91
Steven John Sistare. 1991. A Graphical Editor for Three-Dimensional Constraint-Based Geometric Modeling.''
The design of geometric models can be a painstaking and time- consuming task. Typical CAD packages are often primitive or deficient in the means they offer for placing geometry in a design or for subsequently modifying the geometry. Modification in particular can tax a user's patience when it requires many deletion, creation, or perturbation operations to effect a conceptually simple change in the design. One area of research that attempts to address these deficiencies involves the use of constraints on the geometry as a means of both specifying and controlling its shape. The form in which the constraint information is demanded from the user determines the ease of use of any geometric system that is based on constraints, and is one of the basic problems to be addressed in the design of such a system. I present a constraint-based geometric editor that allows the manipulation of both constraints and geometry using the direct -manipulation paradigm, which is well established as being of central importance in many easy-to-use systems. When using the editor, constraints are presented to the user graphically, in the context of the geometric design, and may be created, destroyed, and manipulated interactively along with the geometry. The constraints may either be created explicitly, or implicitly as a side effect of creating geometry. In addition, constraints are used in a novel way to facilitate interactive creation and positioning of geometry in three space, despite the limitations of commonly-available two-dimensional display and input devices. Lastly, whenever geometry is modified using direct manipulation, a solver is called which updates the geometry in accordance with the existing constraints. All of these features contribute to the ease of use of my system. I also present a solver that addresses another basic problem inherent in constraint-based systems; namely, the need to efficiently obtain a solution that instantiates the geometry so as to satisfy the constraints. I provide a robust and efficient solver that is $O(n^2)$ in the size of the geometric problem being solved and present the mathematics that support it. In addition, I present a new algorithm that partitions the free-form geometry and constraint network into a number of pieces that may be independently solved. For many networks, this algorithm yields a solution to the entire network in close to linear time.
TR-07-91
Alfredo De Santis and Giuseppe Persiano. 1991. Tight Upper and Lower Bounds on the Path Length of Binary Trees.''
The {\em external path length} of a tree $T$ is the sum of the lengths of the paths from the root to each external node. The {\em maximal path length difference,} $riangle$, is the difference between the length of the longest and shortest such path. We prove tight lower and upper bounds on the external path length of binary trees with $N$ external nodes and prescribed maximal path length difference $riangle$. In particular, we give an upper bound that, for each value of $riangle$, can be exactly achieved for infinitely many values of $N$. This improves on the previously known upper bound that could only be achieved up to a factor proportional $N$. We then use the upper bound to give a simple upper bound on the path length of Red-Black trees which is asymptotically tight. We also present, as a preliminary result, an elementary proof of the known upper bound. We finally prove a lower bound which can be exactly achieved for each value of $N$ and $riangle leq N/2$.
TR-08-91
Robert Muller and Yuli Zhou. 1991. Abstract Semantics of First-Order Recursive Schemes.''
We develop a general framework for deriving abstract domains from concrete semantic domains in the context of first-order recursive schemes and prove several theorems which ensure the correctness (safety) of abstract computations. The abstract domains, which we call {\em Weak Hoare powerdomains} , subsume the roles of both the abstract domains and the collecting interpretations in the abstract interpretation literature.
TR-09-91
Robert Muller and Yuli Zhou. 1991. Semantic Domains for Abstract Interpretation.''
In this paper we consider abstract interpretation of PCF programs. The main development is the extension of {\em weak powerdomains} to higher types. In the classical abstract interpretation approach, abstract domains are constructed explicitly and the abstract semantics is then related to the concrete semantics. In the approach introduced here, abstract domains are {\em derived} directly from concrete domains. The conditions for deriving the domains are intended to be as general as possible while still guaranteeing that the derived domain has sufficient structure so that it can be used as a basis for computing correct information about the concrete semantics. We prove three main theorems, the last of which ensures the correctness of abstract interpretation of PCF programs given safe interpretations of the constants. This generalizes earlier results obtained for the special case of strictness analysis.
TR-10-91 [tr-10-91.ps.gz (82 K)]
S. Lennart Johnsson. 1991. Performance Modeling of Distributed Memory Architectures.''
We provide performance models for several primitive operations on data structures distributed over memory units interconnected by a Boolean cube network. In particular, we model single source, and multiple source concurrent broadcasting or reduction, concurrent gather and scatter operations, shifts along several axes of multi- dimensional arrays, and emulation of butterfly networks. We also show how the processor configuration, data aggregation, and the encoding of the address space affect the performance for two important basic computations: the multiplication of arbitrarily shaped matrices, and the Fast Fourier Transform. We also give an example of the performance behavior for local matrix operations for a processor with a single path to local memory, and a set of registers. The analytic models are verified by measurements on the Connection Machine model CM-2.
TR-11-91
Cesar Galindo-Legaria and Arnon Rosenthal. 1991. Outerjoins---How to Extend a Conventional Optimizer.''
Free choice among join orderings is one of the most powerful optimizations in a conventional optimizer. But the freedom is limited to Select/Project/Join queries. In this paper, we extend this freedom to queries that include outerjoins. Unlike previous work, these results are not limited to queries possessing a nice structure,'' or queries that are nicely represented in relational calculus. Our theoretical results concern query simplification'' and reassociation using a generalized outerjoin. We show how the necessary computation can be added rather easily to the join-order generation of a conventional query optimizer.
TR-12-91
Karen E. Lochbaum. 1991. An Algorithm for Plan Recognition in Collaborative Discourse.''
A model of plan recognition in discourse must be based on intended recognition, distinguish each agent's beliefs and intentions from the other's, and avoid assumptions about the correctness or completeness of the agents' beliefs. In this paper, we present an algorithm for plan recognition that is based on the SharedPlan model of collaboration and that satisfies these constraints.
TR-13-91
Michael F. Kilian. 1991. Object-Oriented Programming for Massively Parallel Machines.''
Large, robust massively parallel programs that are understandable (and therefore maintainable) are not yet a reality. Such programs require a programming methodology that minimizes the conceptual differences between the program and the domain addressed by the program, encourages reusability, and still produces robust programs that are readily maintained and reasoned about. This paper proposes the parallel object-oriented model. The model is constructed from an object-oriented methodology augmented by constructs and semantics for parallel processing, and satisfies the requirements for building large parallel applications. It presents a unique way of representing object references and of managing concurrent access to objects. The methodology may be extended for a wide range of computing platforms and application areas.
TR-14-91
Karen E. Lochbaum. 1991. Plan Recognition in Collaborative Discourse.''
A model of plan recognition in discourse must be based on intended recognition, distinguish each agent's beliefs and intentions from the other's, and avoid assumptions about the correctness or completeness of the agents' beliefs. In this paper, we present an algorithm for plan recognition that is based on the SharedPlan model of collaboration and that satisfies these constraints.
TR-15-91
Michael Mitzenmacher. 1991. Elliptic Curves in Computer Science: Primality Testing, Factoring, and Cryptography.''

TR-16-91
Thomas R. Hancock. 1991. Identifiability is Closed under Embeddings in Read-Once Formulas or $\mu$-Decision Trees.''
We show a general positive result that allows us to boost the expressiveness of projection closed classes of boolean functions that are identifiable with membership and equivalence queries. Such classes include monotone DNF formulas, read-once formulas, conjunctions of Horn clauses, switch configurations, $\mu$-formula decision trees, and read-twice DNF formulas. We show that when representations from such classes (rather than single literals) are tested at the leaves of a read-once formula, or on the internal nodes of a $\mu$-decision tree, the resulting representation class is still identifiable with membership and equivalence queries. The additional overhead in time and queries is polynomial.
TR-17-91 [tr-17-91.ps.gz (61 K)]
J. Thomas Ngo and Joe Marks. 1991. Computational Complexity of a Problem in Molecular Structure Prediction.''
The computational task of protein-structure prediction is believed to require exponential time, but previous arguments as to its intractability have taken into account only the size of a protein's conformational space. Such arguments do not rule out the possible existence of an algorithm, more selective than exhaustive search, that is efficient and exact. (An {\em efficient} algorithm is one that is guaranteed, for all possible inputs, to run in time bounded by a function polynomial in the problem size. An {\em intractable} problem is one for which no efficient algorithm exists.) Questions regarding the possible intractability of problems are often best answered using the theory of NP-completeness. In this treatment we show the NP- hardness of two typical mathematical statements of empirical potential energy function minimization for macromolecules. Unless all NP-complete problems can be solved efficiently, these results imply that a function-minimization algorithm can be efficient for protein-structure prediction only if it exploits protein-specific properties that prohibit the simple geometric constructions that we use in our proofs. Analysis of further mathematical statements of molecular structure prediction could constitute a systematic methodology for identifying sources of complexity in protein folding, and for guiding development of predictive algorithms.
TR-18-91 [tr-18-91.ps.gz (60 K)]
S. Lennart Johnsson and Ching-Tien Ho. 1991. Optimal All-to-All Personalized Communication with Minimum Span on Boolean Cubes.''
All-to-all personalized communication is a class of permutations in which each processor sends a unique message to every other processor. We present optimal algorithms for concurrent communication on all channels in Boolean cube networks, both for the case with a single permutation, and the case where multiple permutations shall be performed on the same local data set, but on different sets of processors. For \ $K$ \ elements per processor our algorithms give the optimal number of elements transfer, \ $K/2.$ \ For a succession of all-to-all personalized communications on disjoint subcubes of \ $eta$ \ dimensions each, our best algorithm yields \ $\frac {K} {2} + \sigma - eta$ \ element exchanges in sequence, where \ $\sigma$ \ is the total number of processor dimensions in the permutation. An implementation on the Connection Machine of one of the algorithms offers a maximum speed-up of 50% compared to the previously best known algorithm.
TR-19-91 [tr-19-91.ps.gz (57 K)]
Ching-Tien Ho, S. Lennart Johnsson, and Alan Edelman. 1991. Matrix Multiplication on Hypercubes Using Full Bandwidth and Constant Storage.''
For matrix multiplication on hypercube multiprocessors with the product matrix accumulated in place a processor must receive about \ $P^ {2} /\sqrt {N}$ \ elements or each input operand, with operands of size \ $P imes P$ \ distributed evenly over \ $N$ \ processors. With concurrent communication on all ports, the number of element transfers in sequence can be reduced to \ $P^ {2} /\sqrt {N} log N$ \ for each input operand. We present a two-level partitioning of the matrices and an algorithm for the matrix multiplication with optimal data motion and constant storage. The algorithm has sequential arithmetic complexity \ $2P^ {3} ,$ \ and parallel arithmetic complexity \ $2P^ {3} /N.$ \ The algorithm has been implemented on the Connection Machine model CM-2. For the performance on the 8K CM-2, we measured about 1.6 Gflops, which would scale up to about 13 Gflops for a 64K full machine.
TR-20-91 [tr-20-91.ps.gz (69 K)]
S. Lennart Johnsson and Ching-Tien Ho. 1991. On the Conversion between Binary Code and Binary-Reflected Gray Code on Boolean Cubes.''
We present a new algorithm for conversion between binary code and binary-reflected Gray code that requires approximately \ $\frac {2K} {3}$ \ element transfers in sequence for \ $K$ \ elements per node, compared to \ $K$ \ element transfers for previously known algorithms. For a Boolean cube of \ $n = 2$ \ dimensions the new algorithm degenerates to yield a complexity of \ $\frac {K} {2} + 1$ \ element transfers, which is optimal. The new algorithm is optimal within a factor of \ $\frac {1} {3}$ \ for any routing strategy. We show that the minimum number of element transfers for minimum path length routing is \ $K$ \ with concurrent communication on all channels of every node of a Boolean cube.
TR-21-91 [tr-21-91.ps.gz (81 K)]
Jean-Philippe Brunet and S. Lennart Johnsson. 1991. All-to-All Broadcast and Applications on the Connection Machine.''
An all-to-all broadcast routing algorithm that allows concurrent communication on all channels of the Connection Machine Boolean cube network is described. Explicit routing formulas are given for both the physical broadcast between processors, and the virtual broadcast within processors. Implementation issues are addressed and timings for the physical and virtual broadcast are given for the Connection Machine system CM-2. The peak data transfer rate for the physical broadcast on a 64k CM-2 is 4.1 Gbytes/sec, and the peak rate for the virtual broadcast is about 20 Gbytes/sec. Reshaping of arrays is shown experimentally to reduce the broadcast time by a factor of up to 7 by reducing the amount of local data motion. Finally, we also show how to exploit symmetry for computation of an interaction matrix using the all-to-all broadcast function. Further optimizations are suggested for \ $N-$body type calculations. Using the all-to-all broadcast function, a peak rate of 5 Gflops/s has been achieved for the \ $N-$body computations in 32-bit precision on a 64k CM-2.
TR-22-91
Corey Kosak, Joe Marks, and Stuart Shieber. 1991. New Approaches to Automating Network-Diagram Layout.''
Network diagrams are a familiar graphic form that can express many different kinds of information. The problem of automating network -diagram layout has therefore received much attention. Previous research on network-diagram layout has focused on the problem of aesthetically optimal layout, using such criteria as the number of link crossings, the sum of all link lengths, and total diagram area. In this paper we propose a restatement of the network- diagram-layout problem in which layout-aesthetic concerns are subordinated to perceptual-organization concerns. We describe a notation for describing the visual organization of a network diagram. This notation is used in reformulating the layout task as a constrained-optimization problem in which constraints are derived from a visual-organization specification and optimality criteria are derived from layout-aesthetic considerations. Two new heuristic algorithms are presented for this version of the layout problem: one algorithm uses a rule-based strategy for computing a layout; the other is a massively parallel genetic algorithm. We demonstrate the capabilities of the two algorithms by testing them on a variety of network-diagram-layout problems.
TR-23-91 [tr-23-91.ps.gz (144 K)]
S. Lennart Johnsson. 1991. Minimizing the Communication Time for Matrix Multiplication on Multi-Processors.''
We present a few algorithms that allow concurrency in communication on multiple channels of multi-processors to be exploited for the multiplication of matrices of arbitrary shapes. For multi-processors configured as \ $n-$dimensional Boolean cubes our algorithms offer a speedup of the communication over previous algorithms for square matrices and square cubes by a factor of \ $\frac {n} {2} .$ \ We show that configuring \ $N$ \ processors as a three-dimensional array may reduce the communication complexity by a factor of \ $\sqrt[6] {N}$ \ compared to the two-dimensional partitioning. The best two- dimensional configuration of the multi-processor nodes has a ratio between the number of rows and columns equal to the ratio between the number of rows and columns of the product matrix. The optimum three dimensional configuration has a ratio between the length of the machine axes equal to the ratio between the length of the three axis in matrix multiplication. For product matrices of extreme shape a one-dimensional partitioning may be optimum. All presented algorithms use standard communication functions.
TR-24-91 [tr-24-91.ps.gz (93 K)]
S. Lennart Johnsson and Robert L. Krawitz. 1991. Cooley-Tukey FFT on the Connection Machine.''
We describe an implementation of the Cooley Tukey complex-to- complex FFT on the Connection Machine. The implementation is designed to make effective use of the communications bandwidth of the architecture, its memory bandwidth, and storage with precomputed twiddle factors. The peak data motion rate that is achieved for the interprocessor communication stages is in excess of 7 Gbytes/s for a Connection Machine system CM-200 with 2048 floating-point processors. The peak rate of FFT computations local to a processor is 12.9 Gflops/s in 32-bit precision, and 10.7 Gflops/s in 64-bit precision. The same FFT routine is used to perform both one- and multi-dimensional FFT without any explicit data rearrangement. The peak performance for one-dimensional FFT on data distributed over all processors is 5.4 Gflops/s in 32-bit precision and 3.2 Gflops/s in 64-bit precision. The peak performance for square, two-dimensional transforms, is 3.1 Gflops/s in 32-bit precision, and for cubic, three dimensional transforms, the peak is 2.0 Gflops/s in 64-bit precision. Certain oblong shapes yield better performance. The number of twiddle factors stored in each processor is \ $\frac {P} {2N} + log_ {2} N$ \ for an FFT on \ $P$ \ complex points uniformly distributed among \ $N$ \ processors. To achieve this level of storage efficiency we show that a decimation-in-time FFT is required for normal order input, and a decimation-in-frequency FFT is required for bit-reversed input order.
TR-25-91 [tr-25-91.ps.gz (114 K)]
S. Lennart Johnsson, Michel Jacquemin, and Robert L. Krawitz. 1991. Communication Efficient Multi-processor FFT.''
Computing the Fast Fourier Transform on a distributed memory architecture by a direct pipelined radix-2 algorithm, a bi-section or multi-section algorithm, all yield the same communications requirement, if communication for all FFT stages can be performed concurrently, the input date is in normal order, and the data allocation consecutive. With a cyclic data allocation, or bit- reversed input data and a consecutive allocation, multi-sectioning offers a reduced communications requirement by approximately a factor of two. For a consecutive data allocation, normal input order, a decimation-in-time FFT requires that \ $\frac {P} {N} + d - 2$ \ twiddle factors be stored for \ $P$ \ elements distributed evenly over \ $N$ \ processors and the axis subject to transformation distributed over \ $2^ {d}$ \ processors. No communication of twiddle factors is required. The same storage requirements hold for a decimation-in-frequency FFT, bit-reversed input order, and consecutive data allocation. The opposite combination of FFT type and data ordering requires a factor of \ $log_ {2} N$ \ more storage for \ $N$ \ processors. The peak performance for a Connection Machine system CM-200 implementation is 12.9 Gflops/s in 32-bit precision, and 10.7 Gflops/s in 64-bit precision for unordered transforms local to each processor. The corresponding execution rates for ordered transforms are 11.1. Gflops/s and 8.5 Gflops/s, respectively. For distributed one- and two-dimensional transforms the peak performance for unordered transforms exceeds 5 Gflops/s in 32-bit precision, and 3 Gflops/s in 64-bit precision. Three-dimensional transforms executes at a slightly lower rate. Distributed ordered transforms executes at a rate of about \ $\frac {1} {2}$ \ to \ $\frac {2} {3}$ \ of the unordered transforms.
TR-26-91
Thomas R. Hancock, Mostefa Golea, and Mario Marchand. 1991. Learning Nonoverlapping Perceptron Networks From Examples and Membership Queries.''
We investigate, within the PAC learning model, the problem of learning nonoverlapping perceptron networks. These are loop-free neural nets in which each node has only one outgoing weight. We give a polynomial time algorithm that PAC learns any nonoverlapping perceptron network using examples and membership queries. The algorithm is able to identify both the architecture and the weight values necessary to represent the function to be learned. Our results shed some light on the effect of the overlap on the complexity of learning in neural networks.
TR-27-91
Jon Christensen, Joe Marks, and Stuart Shieber. 1991. Labeling Point Features on Maps and Diagrams Using Simulated Annealing.''
A major factor affecting the clarity of graphical displays is the degree to which text labels obscure display features (including other labels) as a result of spatial overlap. Point-feature label placement (PFLP) is the problem of placing text labels adjacent to point features on a map or diagram in order to maximize legibility. This problem arises for all kinds of informational graphics, though it is most often associated with automated cartography. In this paper we present a comprehensive treatment of the PFLP problem. First, we summarize some recent results regarding the computational complexity of PFLP. These results show that optimal PFLP is NP-hard. Second, we survey previously reported algorithms for PFLP. Third, we describe a stochastic- optimization method for PFLP, based on simulated annealing. Finally, we present the results of an empirical comparison of the known algorithms for PFLP. Our results indicate that the simulated-annealing approach to PFLP is superior to all existing methods, regardless of label density.
TR-28-91
Marios Mavronicolas and Dan Roth. 1991. Linearizable Read/Write Objects.''
We study the cost of implementing {\em linearizable} read/write objects for shared-memory multiprocessors and under various assumptions on the available timing information. We take as cost measure the {\em worst-case response time} of performing an operation in distributed implementations of virtual shared memory consisting of such objects and supporting linearizability. It is assumed that processes have clocks that run at the same rate as real time and all messages incur a delay in the range $[d-u,d]$ for some known constants $u$ and $d$, $0 leq u leq d$. In the {\em perfect clocks} model, where processes have perfectly synchronized clocks and every message incurs a delay of exactly $d$, we present a family of optimal linearizable implementations, parameterized by a constant $eta$, $0 leq eta leq 1$, for which the worst-case response times for read and write operations are $eta d$ and $(1-eta)d$, respectively. The parameter $eta$ may be appropriately chosen to account for the relative frequencies of read and write operations. Our main result is the first known linearizable implementation for the {\em imperfect clocks} model where clocks are not initially synchronized and message delays can vary, i.e., $u > 0$; it achieves worst-case response times of less than $4u+b$ ($b>0$ is an arbitrarily small constant) and $d+3u$ for read and write operations, respectively. This implementation uses novel synchronization techniques to exploit the lower bound on message delay time and achieve bounds on worst-case response times that depend on the message delay uncertainty $u$. For a wide range of values of $u$, these bounds improve previously known ones for implementations that support consistency conditions even weaker than linearizability.