Harvard Computer Science Technical Reports for 1995

TR-01-95 [tr-01-95.ps.gz (57 K)]
Stanley F. Chen. 1995. ``Bayesian Grammar Induction for Language Modeling.''
We describe a corpus-based induction algorithm for probabilistic context-free grammars. The algorithm employs a greedy heuristic search within a Bayesian framework, and a post-pass using the Inside-Outside algorithm. We compare the performance of our algorithm to n-gram models and the Inside-Outside algorithm in three language modeling tasks. In two of these domains, our algorithm outperforms these other techniques, marking the first time a grammar-based language model has surpassed n-gram modeling in a task of at least moderate size.
TR-02-95 [tr-02-95.ps.gz (314 K)]
Dan Roth. 1995. ``Learning in Order to Reason.''
Any theory aimed at understanding {\em commonsense} reasoning, the process that humans use to cope with the mundane but complex aspects of the world in evaluating everyday situations, should account for its flexibility, its adaptability, and the speed with which it is performed.

In this thesis we analyze current theories of reasoning and argue that they do not satisfy those requirements. We then proceed to develop a new framework for the study of reasoning, in which a learning component has a principal role. We show that our framework efficiently supports a lot ``more reasoning" than traditional approaches and at the same time matches our expectations of plausible patterns of reasoning in cases where other theories do not.

In the first part of this thesis we present a computational study of the knowledge-based system approach, the generally accepted framework for reasoning in intelligent systems. We present a comprehensive study of several methods used in approximate reasoning as well as some reasoning techniques that use approximations in an effort to avoid computational difficulties. We show that these are even harder computationally than exact reasoning tasks. What is more surprising is that, as we show, even the approximate versions of these approximate reasoning tasks are intractable, and these severe hardness results on approximate reasoning hold even for very restricted knowledge representations.

Motivated by these computational considerations we argue that a central question to consider, if we want to develop computational models for commonsense reasoning, is how the intelligent system acquires its knowledge and how this process of interaction with its environment influences the performance of the reasoning system. The {\em Learning to Reason} framework developed and studied in the rest of the thesis exhibits the role of inductive learning in achieving efficient reasoning, and the importance of studying reasoning and learning phenomena together. The framework is defined in a way that is intended to overcome the main computational difficulties in the traditional treatment of reasoning, and indeed, we exhibit several positive results that do not hold in the traditional setting. We develop Learning to Reason algorithms for classes of theories for which no efficient reasoning algorithm exists when represented as a traditional (formula-based) knowledge base. We also exhibit Learning to Reason algorithms for a class of theories that is not known to be learnable in the traditional sense. Many of our results rely on the theory of model-based representations that we develop in this thesis. In this representation, the knowledge base is represented as a set of models (satisfying assignments) rather than a logical formula. We show that in many cases reasoning with a model-based representation is more efficient than reasoning with a formula-based representation and, more significantly, that it suggests a new view of reasoning, and in particular, of logical reasoning.

In the final part of this thesis, we address another fundamental criticism of the knowledge-based system approach. We suggest a new approach for the study of the non-monotonicity of human commonsense reasoning, within the Learning to Reason framework. The theory developed is shown to support efficient reasoning with incomplete information, and to avoid many of the representational problems which existing default reasoning formalisms face.

We show how the various reasoning tasks we discuss in this thesis relate to each other and conclude that they are all supported together naturally.

TR-03-95 [tr-03-95.ps.gz (81 K)]
Roni Khardon. 1995. ``Translating between Horn Representations and their Characteristic Models.''
Characteristic models are an alternative, model based, representation for Horn expressions. It has been shown that these two representations are incomparable and each has its advantages over the other. It is therefore natural to ask what is the cost of translating, back and forth, between these representations. Interestingly, the same translation questions arise in database theory, where it has applications to the design of relational databases.

We study the complexity of these problems and prove some positive and negative results. Our main result is that the two translation problems are equivalent under polynomial reductions, and that they are equivalent to the corresponding decision problem. Namely, translating is equivalent to deciding whether a given set a models is the set of characteristic models for a given Horn expression.

We also relate these problems to translating between the CNF and DNF representations of monotone functions, a well known problem for which no polynomial time algorithm is known. It is shown that in general our translation problems are at least as hard as the latter, and in a special case they are equivalent to it.

TR-04-95
Yan-Zhong Ding and Dan Stefanescu. 1995. ``Volume of a Hyper-Parallelepiped after Affine Transformations, and its Applications to Optimal Parallel Loop Execution.''
THIS REPOT HAS BEEN WITHDRAWN
TR-05-95 [tr-05-95.ps.gz (56 K)]
Robert L. Walton. 1995. ``A Proposed New Memory Manager.''
Memory managers should support compactification, multiple simultaneous garbage collections, and ephemeral collections in a realtime multi-processor shared memory environment. They should permit old addresses of an object to be invalidated without significant delay, and should permit array accesses with no per-element inefficiency.

A new approach to building an optimal standard solution to these requirements is presented for stock hardware and next generation languages. If such an approach should become a standard, this would spur the development of standard hardware to optimize away the overhead.

TR-06-95 [tr-06-95.ps.gz (167 K)]
Cliff Young, Nicolas Gloy, and Michael D. Smith. 1995. ``A Comparative Analysis of Schemes for Correlated Branch Prediction.''
Modern high-performance architectures require extremely accurate branch prediction to overcome the performance limitations of con ditional branches. We present a framework that categorizes branch prediction schemes by the way in which they partition dynamic branches and by the kind of predictor that they use. The framework allows us to compare and contrast branch prediction schemes, and to analyze why they work. We use the framework to show how a static correlated branch prediction scheme increases branch bias and thus improves overall branch prediction accuracy. We also use the framework to identify the fundamental differences between static and dynamic correlated branch prediction schemes. This study shows that there is room to improve the prediction accuracy of existing branch prediction schemes.
TR-07-95 [tr-07-95.ps.gz (86 K)]
Amr Fahmy and Robert Roos. 1995. ``Efficient Learning of Real Time One-Counter Automata.''
We present an efficient learning algorithm for languages accepted by deterministic real time one counter automata (ROCA). The learning algorithm works by first learning an initial segment, $B_n$, of the infinite state machine that accepts the unknown language and then decomposing it into a complete control structure and a partial counter. A new efficient ROCA decomposition algorithm, which will be presented in detail, allows this result. The decomposition algorithm works in $O(n^2log(n))$ where $nc$ is the number of states of $B_n$ for some constant $c$.

If Angluin's algorithm for learning regular languages is used to learn $B_n$ and the complexity of this step is $h(n,m)$ where $m$ is the length of the longest counter example necessary for Angluin's algorithm, the complexity of our algorithm is thus $O(h(n,m) + n^2 log(n))$.

TR-08-95 [tr-08-95.ps.gz (104 K)]
Ted Nesson and S. Lennart Johnsson. 1995. ``ROMM Routing on Mesh and Torus Networks.''
ROMM is a class of Randomized, Oblivious, Multi--phase, Minimal routing algorithms. ROMM routing offers a potential for improved performance compared to both fully randomized algorithms and deterministic oblivious algorithms, under both light and heavy loads. ROMM routing also offers close to best case performance for many common routing problems. In previous work, these claims were supported by extensive simulations on binary cube networks . Here we present analytical and empirical results for ROMM routing on wormhole routed mesh and torus networks. Our simulations show that ROMM algorithms can perform several representative routing tasks 1.5 to 3 times faster than fully randomized algorithms, for medium--sized networks. Furthermore, ROMM algorithms are always competitive with deterministic, oblivious routing, and in some cases, up to 2 times faster.
TR-09-95 [tr-09-95.ps.gz (77 K)]
J. Bradley Chen, Yasuhiro Endo, Kee Chan, David Mazieres, Antonio Dias, Margo Seltzer, and Michael Smith. 1995. ``The Impact of Operating System Structure on Personal Computer Performance.''
This paper presents a comparative study of the performance of three operating systems that run on the personal computer architecture derived from the IBM-PC. The operating systems, Windows for Workgroups (tm), Windows NT (tm), and NetBSD (a freely available UNIX (tm) variant) cover a broad range of system functionality and user requirements, from a single address space model to full protection with preemptive multitasking. Our measurements were enabled by hardware counters in Intel's Pentium (tm) processor that permit measurement of a broad range of processor events including instruction counts and on-chip cache miss rates. We used both microbenchmarks, which expose specific differences between the systems, and application workloads, which provide an indication of expected end-to-end performance. Our microbenchmark results show that accessing system functionality is more expensive in Windows than in the other two systems due to frequent changes in machine mode and the use of system call hooks. When running native applications, Windows NT is more efficient than Windows, but it does incur overhead from its microkernel structure. Overall, system functionality can be accessed most efficiently in NetBSD; we attribute this to its monolithic structure, and to the absence of the complications created by backwards compatibility in the other sys tems. Measurements of application performance show that the impact of these differences is significant in terms of overall execution time.
TR-10-95 [tr-10-95.ps.gz (959 K), tr-10-95-p70.ps.gz (407 K), tr-10-95-p71.ps.gz (394 K)]
Jon Christensen. 1995. ``Managing Design Complexity: Using Stochastic Optimization in the Production of Computer Graphics.''
This thesis examines the automated design of computer graphics. We present a methodology that emphasizes optimization, problem representation, stochastic search, and empirical analysis. Two problems are considered, which together encompass and exemplify both 2D and 3D graphics production: label placement and motion synthesis for animation.

Label placement is the problem of annotating various informational graphics with textual labels, subject to constraints that respect proper label-feature associativity, label and feature obscuration, and aesthetically desirable label positions. Examples of label placement include applying textual labels to a geographic map, or item tags to a scatterplot. Motion synthesis is the problem of composing a visually plausible motion for an animated character, subject to animator-imposed constraints on the form and characteristics of the desired motion.

For each problem we propose new solution methods that utilize efficient problem representations combined with stochastic optimization techniques. We demonstrate that these methods offer significant advantages over competing solutions in terms of ease-of-use, visual quality, and computational efficiency. Taken together, these results also demonstrate an effective approach for continued progress in automating graphical design, which should be applicable to a wide range of graphical design applications beyond the two considered here.

TR-11-95 [tr-11-95.ps.gz (299 K)]
Andrew Kehler. 1995. ``Interpreting Cohesive Forms in the Context of Discourse Inference.''
In this thesis, we present analyses and algorithms for resolving a variety of cohesive phenomena in natural language, including VP-ellipsis, gapping, event reference, tense, and pronominal reference. Past work has attempted to explain the complicated behavior of these expressions with theories that operate within a single module of language processing. We argue that such approaches cannot be maintained; in particular, the data we present strongly suggest that the nature of the coherence relation operative between clauses needs to be taken into account.

We provide a theory of coherence relations and the discourse inference processes that underly their recognition. We utilize this theory to break the deadlock between syntactic and semantic approaches to resolving VP-ellipsis. We show that the data exhibits a pattern with respect to our categorization of coherence relations, and present an account which predicts this pattern. We extend our analysis to gapping and event reference, and show that our analyses result in a more independently-motivated and empirically-adequate distinction among types of anaphoric processes than past analyses.

We also present an account of VP-ellipsis resolution that predicts the correct set of `strict' and `sloppy' readings for a number of benchmark examples that are problematic for past approaches. The correct readings can be seen to result from a general distinction between `referring' and `copying' in anaphoric processes. The account also extends to other types of reference, such as event reference and `one'-anaphora.

Finally, we utilize our theory of coherence in analyses that break the deadlock between definite-reference and coherence-based approaches to tense and pronoun interpretation. We present a theory of tense interpretation that interacts with discourse inference processes to predict data that is problematic for both types of approach. We demonstrate that the data commonly cited in the pronoun interpretation literature also exhibits a pattern with respect to coherence relations, and make some preliminary proposals for how such a pattern might result from the properties of the different types of discourse inference we posit.

TR-12-95 [tr-12-95.ps.gz (852 K)]
Karen McIntosh Daniels. 1995. ``Containment Algorithms for Nonconvex Polygons with Applications to Layout.''
Layout and packing are NP-hard geometric optimization problems which appear in a variety of manufacturing industries. At their core, layout and packing problems have the common geometric feasibility problem of {\em containment}: find a way of placing a set of items into a container. We focus on containment and its applications to layout and packing problems. We demonstrate that, although containment is NP-hard, it is fruitful to: 1) develop algorithms for containment, as opposed to heuristics, 2) design containment algorithms so that they say ``no'' almost as fast as they say ``yes'', 3) use geometric techniques, not just mathematical programming techniques, and 4) maximize the number of items for which the algorithms are practical.

Our approach to containment is based on a new restrict/evaluate/subdivide paradigm. We develop theory and practical techniques for the operations within the paradigm. The techniques are appropriate for two-dimensional containment problems in which the items and container may be irregular polygons, and in which the items may be translated, but not rotated. Our techniques can be combined to form a variety of two-dimensional translational containment algorithms. The paradigm is designed so that, unlike existing iteration-based algorithms, containment algorithms based on the paradigm are adept at saying ``no'', even for slightly infeasible problems. We present two algorithms based on our paradigm. We obtain the first practical running times for NP-complete two-dimensional translational containment problems for up to ten nonconvex items in a nonconvex container.

We demonstrate that viewing containment as a feasibility problem has many benefits for packing and layout problems. For example, we present an effective method for finding minimal enclosures which uses containment to perform binary search on a parameter. Compaction techniques can accelerate the search. We also use containment to develop the first practical pre-packing strategy for a multi-stage pattern layout problem in apparel manufacturing. Pre-packing is a layout method which packs items into a collection of containers by first generating groups of items which fit into each container and then assigning groups to containers.

TR-13-95 [tr-13-95.ps.gz (95 K)]
J. Bradley Chen. 1995. ``Probabilistic Cache Replacement.''
Modern microprocessors tend to use on-chip caches that are much smaller than the working set size of many interesting computations. In such situations, cache performance can be improved through selective caching, use of cache replacement policies where data fetched from memory, although forwarded to the CPU, is not necessarily loaded into the cache. This paper introduces a selective caching policy called Probabilistic Cache Replacement (PCR) in which caching of data fetched from main memory is determined by a probabilistic boolean-valued function. Use of PCR creates a self-selection mechanism in which repeated misses to a word in memory increase its probability of being loaded into the cache. A PCR cache gives better reductions in instruction cache miss rate than a comparable cache configuration with a victim-cache. Instruction cache miss rates can be reduced by up to 30% for some of the SPECmarks, although the optimal probability distribution is workload dependent. This paper also presents a mechanism called Feedback PCR which dynamically selects probability values for a PCR cache. For an 16 K byte direct-mapped instruction cache, Feedback PCR with a one-entry MFB gives an average reduction in cache misses of over 11% across the SPECmarks with no significant increase in cache misses for any of the workloads, and compares favorably with other alternatives of similar hardware cost.
TR-14-95 [tr-14-95.ps.gz (55 K)]
J. Bradley Chen, H.T. Kung, and Margo Seltzer. 1995. ``MOSS: A Mobile Operating Systems Substrate.''
The Mobile Operating System Substrate (MOSS) is a new system architecture for wireless mobile computing being developed at Harvard. MOSS provides highly efficient, robust and flexible virtual device access over wireless media. MOSS services provide mobile access to such resources as disks, CD ROM drives, displays, wired network interfaces, and audio and video devices. MOSS services are composed of virtual circuits and virtual devices. Virtual circuits (VCs) on wireless media support the spectrum of quality-of-service (QoS) levels required to cover a broad range of application requirements. Virtual devices implement resource access using VCs as their communication substrate. The tight coupling of network code and device implementations makes it possible to apply device-specific semantics to communications resource management problems. MOSS will enable mobile software systems to adapt dynamically to the rapidly changing computing and communications environment created by mobility.
TR-15-95 [tr-15-95.ps.gz (90 K)]
Roni Khardon, Heikki Mannila, and Dan Roth. 1995. ``Reasoning with Examples: Propositional Formulae and Database Dependencies.''
For humans, looking at how concrete examples behave is an intuitive way of deriving conclusions. The drawback with this method is that it does not necessarily give the correct results. However, under certain conditions example-based deduction can be used to obtain a correct and complete inference procedure. This is the case for Boolean formulae (reasoning with models) and for certain types of database integrity constraints (the use of Armstrong relations). We show that these approaches are closely related, and use the relationship to prove new results about the existence and sizes of Armstrong relations for Boolean dependencies. Further, we study the problem of translating between different representations of relational databases, in particular we consider Armstrong relations and Boolean dependencies, and prove some positive results in that context. Finally, we discuss the close relations between the questions of finding keys in relational databases and that of finding abductive explanations.
TR-16-95 [tr-16-95.ps.gz (44 K)]
Margo Seltzer, Keith Smith, and Christopher Small. 1995. ``The Case for Extensible Operating Systems.''
Many of the performance improvements cited in recent operating systems research describe specific enhancements to normal operating system functionality that improve performance in a set of designated test cases. Global changes of this sort can improve performance for one application, at the cost of decreasing performance for others. We argue that this flurry of global kernel tweaking is an indication that our current operating system model is inappropriate. Existing interfaces do not provide the flexibility to tune the kernel on a per-application basis, to suit the variety of applications that we now see.

We have failed in the past to be omniscient about future operating system requirements; there is no reason to believe that we will fare any better designing a new, fixed kernel interface today. Instead, the only general-purpose solution is to build an operating system interface that is easily extendable. We present a kernel framework designed to support the application-specific customization that is beginning to dominate the operating system literature. We show how this model enables easy implementation of many of the earlier research results. We then analyze two specific kernel policies: page read-ahead and lock-granting. We show that application-control over read-ahead policy produces performance improvements of up to 16%. We then show how application-control over the lock-granting policy can choose between fairness and response time. Reader priority algorithms produce lower read response time at the cost of writer starvation. FIFO algorithms avoid the starvation problem, but increase read response time.

TR-17-95 [tr-17-95.ps.gz (247 K)]
James Gwertzman. 1995. ``Autonomous Replication in Wide-Area Internetworks.''
The number of users connected to the Internet has been growing at an exponential rate, resulting in similar increases in network traffic and Internet server load. Advances in microprocessors and network technologies have kept up with growth so far, but we are reaching the limits of hardware solutions. In order for the Internet's growth to continue, we must efficiently distribute server load and reduce the network traffic generated by its various services.

Traditional wide-area caching schemes are client initiated. Decisions on where and when to cache information are made without the benefit of the server's global knowledge of the situation. We introduce a technique, push-caching, that is server initiated; it leaves caching decisions to the server. The server uses its knowledge of network topology, geography, and access patterns to minimize network traffic and server load.

The World Wide Web is an example of a large-scale distributed information system that will benefit from this geographical distribution, and we present an architecture that allows a Web server to autonomously replicate Web files. We use a trace-driven simulation of the Internet to evaluate several competing caching strategies. Our results show that while simple client caching reduces server load and network bandwidth demands by up to 30%, adding server-initiated caching reduces server load by an additional 20% and network bandwidth demands by an additional 10%. Furthermore, push-caching is more efficient than client-caching, using an order of magnitude less cache space for comparable bandwidth and load savings.

To determine the optimal cache consistency protocol we used a generic server simulator to evaluate several cache-consistency protocols, and found that weak consistency protocols are sufficient for the World Wide Web since they use the same bandwidth as an atomic protocol, impose less server load, and return stale data less than 1% of the time.

TR-18-95 [tr-18-95.ps.gz (94 K)]
Barbara J. Grosz, Aravind K. Joshi, and Scott Weinstein. 1995. ``Centering: A Framework for Modelling the Local Coherence of Discourse.''

TR-19-95 [tr-19-95.ps.gz (541 K)]
Diane L. Tang. 1995. ``Benchmarking Filesystems.''
One of the most widely researched areas in operating systems is filesystem design, implementation, and performance. Almost all of the research involves reporting performance numbers gathered from a variety of different benchmarks. The problem with such results is that existing filesystem benchmarks are inadequate, suffering from problems ranging from not scaling with advancing technology to not measuring the filesystem.

A new approach to filesystem benchmarking is presented here. This methodology is designed both to help system designers understand and improve existing systems and to help users decide which filesystem to buy or run. For usability, the benchmark is separated into two parts: a suite of micro-benchmarks, which is actually run on the filesystem, and a workload characterizer. The results from the two separate parts can be combined to predict the performance of the filesystem on the workload.

The purpose for this separation of functionality is two-fold. First, many system designers would like their filesystem to perform well under diverse workloads: by characterizing the workload independently, the designers can better understand what is required of the filesystem. The micro-benchmarks tell the designer what needs to be improved while the workload characterizer tells the designer whether that improvement will affect filesystem performance under that workload. This separation also helps users trying to decide which system to run or buy, who may not be able to run their workload on all systems under consideration, and therefore need this separation.

The implementation of this methodology does not suffer from many of the problems seen in existing benchmarks: it scales with technology, it is tightly specified, and it helps system designers. This benchmark's only drawbacks are that it does not accurately predict the performance of a filesystem on a workload, thus limiting its applicability: it is useful to system designers, but not for users trying to decide which system to buy. The belief is that the general approach will work, given additional time to manipulate the prediction algorithm.

TR-20-95 [tr-20-95.ps.gz (248 K)]
Barbara J. Grosz and Sarit Kraus. 1995. ``Collaborative Plans for Complex Group Action.''
The original formulation of SharedPlans was developed to provide a model of collaborative planning in which it was not necessary for one agent to have intentions-to toward an act of a different agent. Unlike other contemporaneous approaches, this formulation provided for two agents to coordinate their activities without introducing any notion of irreducible joint intentions. However, it only treated activities that directly decomposed into single-agent actions, did not address the need for agents to commit to their joint activity, and did not adequately deal with agents having only partial knowledge of the way in which to perform an action. This paper provides a revised and expanded version of SharedPlans that addresses these shortcomings. It also reformulates Pollack's definition of individual plans to handle cases in which a single agent has only partial knowledge; this reformulation meshes with the definition of SharedPlans. The new definitions also allow for contracting out certain actions. The formalization that results has the features required by Bratman's account of shared cooperative activity and is more general than alternative accounts.
TR-21-95 [tr-21-95.ps.gz (64 K)]
Christine H. Nakatani, Barbara J. Grosz, David D. Ahn, and Julia Hirschberg. 1995. ``Instructions for Annotating Discourse.''

TR-22-95 [tr-22-95.ps.gz (174 K)]
Karen Daniels, Victor J. Milenkovic, and Dan Roth. 1995. ``Finding the Largest Rectangle in Several Classes of Polygons.''
This paper considers the geometric optimization problem of finding the Largest area axis-parallel Rectangle (LR) in an $n$-vertex general polygon. We characterize the LR for general polygons by considering different cases based on the types of contacts between the rectangle and the polygon. A general framework is presented for solving a key subproblem of the LR problem which dominates the running time for a variety of polygon types. This framework permits us to transform an algorithm for orthogonal polygons into an algorithm for nonorthogonal polygons. Using this framework, we obtain the following LR time results: $\Theta(n)$ for $xy$-monotone polygons, ${\rm O}(n \alpha(n))$ for orthogonally convex polygons, (where $\alpha(n)$ is the slowly growing inverse of Ackermann's function), ${\rm O}(n \alpha(n) log n)$ for horizontally (vertically) convex polygons, ${\rm O}(n log n)$ for a special type of horizontally convex polygon (whose boundary consists of two $y$-monotone chains on opposite sides of a vertical line), and ${\rm O}(n log^2 n)$ for general polygons (allowing holes). For all these types of non-orthogonal polygons, we match the running time of the best known algorithms for their orthogonal counterparts. A lower bound of time in $Omega(n log n)$ is established for finding the LR in both self-intersecting polygons and general polygons with holes. The latter result gives us both a lower bound of $Omega(n log n)$ and an upper bound of ${\rm O}(n log^2 n)$ for general polygons.
TR-23-95 [tr-23-95.ps.gz (116 K)]
Nicolas Gloy, Michael D. Smith, and Cliff Young. 1995. ``Performance Issues in Correlated Branch Prediction Schemes.''
Accurate static branch prediction is the key to many techniques for exposing, enhancing, and exploiting Instruction Level Parallelism (ILP). The initial work on static correlated branch prediction (SCBP) demonstrated improvements in branch prediction accuracy, but did not address overall performance. In particular, SCBP expands the size of executable programs, which negatively affects the performance of the instruction memory hierarchy. Using the profile information available under SCBP, we can minimize these negative performance effects through the application of code layout and branch alignment techniques. We evaluate the performance effect of SCBP and these profile-driven optimizations on instruction cache misses, branch mispredictions, and branch misfetches for a number of recent processor implementations. We find that SCBP improves performance over (traditional) per-branch static profile prediction. We also find that SCBP improves the performance benefits gained from branch alignment. As expected, SCBP gives larger benefits on machine organizations with high mispredict/misfetch penalties and low cache miss penalties. Finally, we find that the application of profile-driven code layout and branch alignment techniques (without SCBP) can improve the performance of the dynamic correlated branch prediction techniques.
TR-24-95 [tr-24-95.ps.gz (480 K)]
Ted Nesson. 1995. ``Randomized, Oblivious, Minimal Routing Algorithms for Multicomputers .''
Efficient data motion has been critical in high performance computing for as long as computers have been in existence. Massively parallel computers use a sparse interconnection network between processing nodes with local memories. Minimizing the potential for high congestion of communication links is an important goal in the design of routing algorithms and interconnection networks in these systems.

In these distributed--memory architectures, the communication system represents a significant portion of the total system cost, but is nevertheless often a weak link in the system with respect to performance. Efficient interprocessor communication is one of the most important and most challenging problems associated with massively parallel computing. Communication delays can easily represent a large fraction of the total running time, inhibiting high performance computing for a wide range of problems. Efficient use of the communication system is the focus of this thesis.

The design of the interconnection network and the routing algorithms used to transport data between nodes are critical to overall system performance. The constraints imposed by a sparse interconnection network suggest that preserving locality of reference through careful data allocation and minimizing network load by using minimal algorithms are desirable objectives.

In this thesis, we present ROMM, a new class of general--purpose message routing algorithms for large--scale, distributed--memory multicomputers. ROMM is a class of Randomized, Oblivious, Multi--phase, Minimal routing algorithms. We will show that ROMM routing offers the potential for improved performance compared to both fully randomized algorithms and deterministic oblivious algorithms, under both light and heavy loads. ROMM routing also offers close to best--case performance for many common routing tasks. These claims are supported by extensive analysis and simulation of ROMM routing on several different interconnection network architectures, for a set of representative routing tasks. Furthermore, our results show that non--minimality and adaptivity, two common techniques for reducing congestion, are not always required for good routing performance.

TR-25-95 [tr-25-95.ps.gz (213 K)]
Victor J.Milenkovic and Karen M. Daniels. 1995. ``Translational Polygon Containment and Minimal Enclosure using Geometric Algorithms and Mathematical Programming.''
We present an algorithm for the two-dimensional translational {\em containment} problem: find translations for $k$ polygons (with up to $m$ vertices each) which place them inside a polygonal container (with $n$ vertices) without overlapping. The polygons and container may be nonconvex. The containment algorithm consists of new algorithms for {\em restriction}, {\em evaluation}, and {\em subdivision} of two-dimensional configuration spaces. The restriction and evaluation algorithms both depend heavily on linear programming; hence we call our algorithm an {\em LP containment algorithm}. Our LP containment algorithm is distinguished from previous containment algorithms by the way in which it applies principles of mathematical programming and also by its tight coupling of the evaluation and subdivision algorithms. Our new evaluation algorithm finds a local overlap minimum. Our distance-based subdivision algorithm eliminates a ``false'' (local but not global) overlap minimum and all layouts near that overlap minimum, allowing the algorithm to make progress towards the global overlap minimum with each subdivision. /par In our experiments on data sets from the apparel industry, our LP algorithm can solve containment for up to ten polygons in a few minutes on a desktop workstation. Its practical running time is better than our previous containment algorithms and we believe it to be superior to all previous translational containment algorithms. Its theoretical running time, however, depends on the number of local minima visited, which is $igo((6kmn+k^2m^2)^{2k+1}/k!)$. To obtain a better theoretical running time, we present a modified (combinatorial) version of LP containment with a running time of \[ igoleft(\frac{(6kmn+k^2m^2)^{2k}}{(k-5)!} log kmn \right), \] which is better than any previous combinatorial containment algorithm. For constant $k$, it is within a factor of $log mn$ of the lower bound. /par We generalize our configuration space containment approach to solve {\em minimal enclosure} problems. We give algorithms to find the minimal enclosing square and the minimal area enclosing rectangle for $k$ translating polygons. Our LP containment algorithm and our minimal enclosure algorithms succeed by combining rather than replacing geometric techniques with linear programming. This demonstrates the manner in which linear programming can greatly increase the power of geometric algorithms.
TR-26-95 [tr-26-95.ps.gz (60 K)]
J. Bradley Chen and Alan Eustace. 1995. ``Kernel Instrumentation Tools and Techniques.''
Atom is a powerful platform for the implementation of profiling, debugging and simulation tools. Kernel support in ATOM makes it possible to implement similar tools for the Digital UNIX kernel. We describe four non-trivial Atom kernel tools which demonstrate the support provided in Atom for kernel work as well as the range of application of Atom kernel tools. We go on to discuss some techniques that are generally useful when using Atom with the kernel. Prior techniques restrict kernel measurements to the domain of exotic systems research. We hope Atom technology will make kernel instrumentation and measurement practical for a much larger community of researchers.
TR-27-95 [tr-27-95.ps.gz (52 K)]
Amr F. Fahmy and Robert A. Wagner. 1995. ``On the Transportation and Distribution of Data Structures in Parallel and Distributed Systems.''
We present algorithms for the transportation of data in parallel and distributed systems that would enable programmers to transport or distribute a data structure by issuing a function call. Such a functionality is needed if programming distributed memory systems is to become commonplace.

The distribution problem is defined as follows. We assume that $n$ records of a data structure are scattered among $p$ processors where processor $q_i$ holds $r_{i}$ records, $1 leq i leq p$. The problem is to redistribute the records so that each processor holds $lfloor n/p \rfloor$ records. We solve the problem in the minimum number of parallel data-permutation operations possible, for the given initial record distribution. This means that we use $max( mxr - lfloor n/p \rfloor, lfloor n/p \rfloor - mnr )$ parallel data transfer steps, where $mxr = max(r_{i})$ and $ mnr = min(r_{i})$ for $1 leq i leq p$.

Having solved the distribution problem, it then remains to transport the data structure from the memory of one processor to another. In the case of dynamically allocated data structures, we solve the problem of renaming pointers by creating an intermediate name space. We also present a transportation algorithm that attempts to hide the cost of making a local copy for the data structure which, is necessary since the data structure could be scattered in the memory of the sender.

TR-28-95 [tr-28-95.ps.gz (117 K)]
Roni Khardon. 1995. ``Learning to take Actions.''
We formalize a model for supervised learning of action strategies in dynamic stochastic domains and show that PAC-learning results on Occam algorithms hold in this model as well. We then identify a class of rule based action strategies for which polynomial time learning is possible. The representation of strategies is generalization of decision lists; strategies include rules with existentially quantified conditions, simple recursive predicates, and small internal state, but are syntactically restricted. We also study the learnability of hierarchically composed strategies where a subroutine already acquired can be used as a basic action in a higher level strategy. We prove some positive results in this setting, but also show that in some cases the hierarchical learning problem is computationally hard.
TR-29-95 [tr-29-95.ps.gz (629 K)]
S. Lennart Johnsson. 1995. ``Network Related Performance Issues and Techniques for MPPs.''
In this paper we review network related performance issues for current Massively Parallel Processors (MPPs) in the context of some important basic operations in scientific and engineering computation. The communication system is one of the most performance critical architectural components of MPPs. In particular, understanding the demand posed by collective communication is critical in architectural design and system software implementation. We discuss collective communication and some implementation techniques therefore on electronic networks. Finally, we give an example of a novel general routing technique that exhibits good scalability, efficiency and simplicity in electronic networks.
TR-30-95 [tr-30-95.ps.gz (426 K)]
Scott Evan Decatur. 1995. ``Efficient Learning from Faulty Data.''
Learning systems are often provided with imperfect or noisy data. Therefore, researchers have formalized various models of learning with noisy data, and have attempted to delineate the boundaries of learnability in these models. In this thesis, we describe a general framework for the construction of efficient learning algorithms in noise tolerant variants of Valiant's PAC learning model. By applying this framework, we also obtain many new results for specific learning problems in various settings with faulty data.

The central tool used in this thesis is the specification of learning algorithms in Kearns' Statistical Query (SQ) learning model, in which statistics, as opposed to labelled examples, are requested by the learner. These SQ learning algorithms are then converted into PAC algorithms which tolerate various types of faulty data.

We develop this framework in three major parts:

  • We design automatic compilations of SQ algorithms into PAC algorithms which tolerate various types of data errors. These results include improvements to Kearns' classification noise compilation, and the first such compilations for malicious errors, attribute noise and new classes of ``hybrid'' noise composed of multiple noise types.
  • We prove nearly tight bounds on the required complexity of SQ algorithms. The upper bounds are based on a constructive technique which allows one to achieve this complexity even when it is not initially achieved by a given SQ algorithm.
  • We define and employ an improved model of SQ learning which yields noise tolerant PAC algorithms that are more efficient than those derived from standard SQ algorithms.
Together, these results provide a unified and intuitive framework for noise tolerant learning that allows the algorithm designer to achieve efficient, and often optimal, fault tolerant learning.
TR-31-95 [tr-31-95.ps.gz (61 K)]
Christopher Small and Margo Seltzer. 1995. ``Scheduler Activations on BSD: Sharing Thread Management Between Kernel and Application.''
There are two commonly used thread models: kernel level threads and user level threads. Kernel level threads suffer from the cost of frequent user-kernel domain crossings and fixed kernel scheduling priorities. User level threads are not integrated with the kernel, blocking all threads whenever one thread is blocked. The Scheduler Activations model, proposed by Anderson et al.\, combines kernel CPU allocation decisions with application control over thread scheduling. This paper discusses the performance characteristics of an implementation of Scheduler Activations for a uniprocessor BSD system, and proposes an analytic model for determining the class of applications that benefit from its use. Our implementation required fewer than two hundred lines of kernel code and provides an order of magnitude performance improvement over process-level facilities.
TR-32-95 [tr-32-95.ps.gz (176 K)]
Nadia Shalaby and S. Lennart Johnsson. 1995. ``A Vector Space Framework for Parallel Stable Permutations.''
We establish a formal foundation for stable permutations in the domain of a parallel model of computation applicable to a customized set of complexity metrics. By means of vector spaces, we develop an algebrao-geometric representation that is expressive, flexible and simple to use, and present a taxonomy categorizing stable permutations into classes of index-digit, linear, translation, affine and polynomial permutations. For each class, we demonstrate its general behavioral properties and then analyze particular examples in each class, where we derive results about its inverse, fixed instances, number of instances local and nonlocal to a processor, as well as its compositional relationships to other permutations. Such examples are bit-reversal, radix-Q exchange, radix-Q shuffle and unshuffle within the index-digit class, radix-Q butterfly and 1's complement within the translation class, binary-to-Gray and Gray-to-binary within the linear class, and arithmetic add 1, arithmetic subtract 1 and 2's complement in the polynomial class. These were primarily chosen due to their importance in implementing orthogonal transforms, such as Cooley-Tukey Fast Fourier Transforms (FFT), real-to-complex (complex-to-real) FFT, and sine and cosine transforms on distributed memory parallel processing systems.
TR-35-95 [tr-35-95.ps.gz (54 K)]
S. Lennart Johnsson . 1995. ``Data Partitioning for Load--Balance and Communication Bandwidth Preservation.''
Preservation of locality of reference is the most critical issue for performance in high performance architectures, especially scalable architectures with electronic interconnection networks. We briefly discuss some of the issues in this paper and give results from the use of spectral bisection on irregular grid computations on the Connection Machine systems.
TR-36-95 [tr-36-95.ps.gz (170 K)]
Yu Hu, S. Lennart Johnsson, Dimitris Kehagias, and Nadia Shalaby. 1995. ``DPF: A Data Parallel Fortran Benchmark Suite.''
The Data Parallel Fortran (DPF) benchmark suite is designed for evaluating data parallel compilers and scalable architectures. Many of the DPF codes are provided in three versions: basic, optimized and with library calls for performance critical operations typically found in software libraries. The functionality of the benchmarks cover collective communication functions, scientific software library functions, and application kernels that reflect the computational structure and communication patterns in fluid dynamic simulations, fundamental physics and molecular studies in chemistry and biology. The intended target language is High Performance Fortran (HPF). However, due to the lack of HPF compilers at the time of this benchmark development, the suite is written in Connection Machine Fortran (CMF). The DPF codes provide performance evaluation metrics in the form of busy and elapsed times (CM-5), FLOP rates and arithmetic efficiency. The codes are characterized in terms of FLOP count, memory usage, communication pattern, local memory accesses, array allocation mechanism, as well as operation and communication counts per iteration.
TR-37-95 [tr-37-95.ps.gz (292 K)]
Robert Lee Walton. 1995. ``R-Code a Very Capable Virtual Computer.''
This thesis investigates the design of a machine independent virtual computer, the R-CODE computer, for use as a target by high level language compilers. Unlike previous machine independent targets, R-CODE provides higher level capabilities, such as a garbage collecting memory manager, tagged data, type maps, array descriptors, register dataflow semantics, and a shared object memory. Emphasis is on trying to find universal versions of these high level features to promote interoperability of future programming languages and to suggest a migration path for future hardware.

The memory manager design combines both automatic garbage detection and an explicit ``manual'' delete operation. It permits objects to be copied at any time, to compact memory or expand objects. It traps obsolete addresses and instantly forwards copied objects using a software cache of an object map. It uses an optimized write-barrier, and is better suited for real-time than a standard copying collector.

R-CODE investigates the design of type maps that extend the virtual function tables of C++ and similar tables of HASKELL, EIFFEL, and SATHER 0.6. R-CODE proposes to include numeric types and sizes in type maps, and to inline information from type maps by using dynamic case statements, which switch on a type map identifier. When confronted with a type map not seen before, a dynamic case statement compiles a new case of itself to handle the new type.

R-CODE also investigates using IEEE floating point signaling-NaNs as tagged data, and making array descriptors first class data.

R-CODE uses a new ``register dataflow'' execution flow model to better match the coming generation of superscalar processors. Functional dataflow is used for operations on register values, and memory operations are treated as unordered I/O. Barriers are introduced to sequence groups of unordered memory operations. A detailed semantic execution flow model is presented.

R-CODE includes a shared object memory design to support multi-threaded programming within a building where network shared object memory reads and writes take several thousand instruction-execution-times to complete. The design runs on existing symmetric processors, but requires special caches to run on future within-building systems.