The central tool used in this thesis is the specification of learning algorithms in Kearns' Statistical Query (SQ) learning model, in which statistics, as opposed to labelled examples, are requested by the learner. These SQ learning algorithms are then converted into PAC algorithms which tolerate various types of faulty data.

We develop this framework in three major parts:

- We design automatic compilations of SQ algorithms into PAC algorithms which tolerate various types of data errors. These results include improvements to Kearns' classification noise compilation, and the first such compilations for malicious errors, attribute noise and new classes of ``hybrid'' noise composed of multiple noise types.
- We prove nearly tight bounds on the required complexity of SQ algorithms. The upper bounds are based on a constructive technique which allows one to achieve this complexity even when it is not initially achieved by a given SQ algorithm.
- We define and employ an improved model of SQ learning which yields noise tolerant PAC algorithms that are more efficient than those derived from standard SQ algorithms.

| [ XX' XY' ] [ A B ] |2 min | [ ] - [ ] | Y | [ YX' YY' ] [ B' C ] |Fwhere A = XX', and B, C are matrices of inner products calculated from the estimated distances. \par The vanishing of the gradient of the STRAIN is shown to be equivalent to a system of only six nonlinear equations in six unknowns for the inertial tensor associated with the solution Y. The entire solution space is characterized in terms of the geometry of the intersection curves between the unit sphere and certain variable ellipsoids. Upon deriving tight bilateral bounds on the moments of inertia of any possible solution, we construct a search procedure that reliably locates the global minimum. The effectiveness of this method is demonstrated on realistic simulated and chemical test problems. %X tr-04-97.ps.gz %R TR-05-97 %D 1997 %T Development of a Systematic Approach to Bottleneck Identification in UNIX systems %A Lori Park %X tr-05-97.ps.gz %R TR-06-97 %D 1997 %T A Revisitation of Kernel Synchronization Schemes %A Christopher Small %A Stephen Manley %X In an operating system kernel, critical sections of code must be protected from interruption. This is traditionally accomplished by masking the set of interrupts whose handlers interfere with the correct operation of the critical section. Because it can be expensive to communicate with an off-chip interrupt controller, more complex optimistic techniques for masking interrupts have been proposed. \par In this paper we present measurements of the behavior of the NetBSD 1.2 kernel, and use the measurements to explore the space of kernel synchronization schemes. We show that (a) most critical sections are very short, (b) very few are ever interrupted, (c) using the traditional synchronization technique, the synchronization cost is often higher than the time spent in the body of the critical section, and (d) under heavy load NetBSD 1.2 can spend 9% to 12% of its time in synchronization primitives. \par The simplest scheme we examined, disabling all interrupts while in a critical section or interrupt handler, can lead to loss of data under heavy load. A more complex optimistic scheme functions correctly under the heavy workloads we tested and has very low overhead (at most 0.3%). Based on our measurements, we present a new model that offers the simplicity of the traditional scheme with the performance of the optimistic schemes. \par Given the relative CPU, memory, and device performance of today's hardware, the newer techniques we examined have a much lower synchronization cost than the traditional technique. Under heavy load, such as that incurred by a web server, a system using these newer techniques will have noticeably better performance. %X tr-06-97.ps.gz %R TR-07-97 %D 1997 %T An IRAM-Based Architecture for a Single-Chip ATM Switch %A A. Brown %A I. Papaefstathiou %A J. Simer %A D. Sobel %A J. Sutaria %A S. Wang %A T. Blackwell %A M. Smith %A W. Yang %X We have developed an architecture for an IRAM-based ATM switch that is implemented with merged DRAM and logic for a cost of about $100. The switch is based on a shared-buffer memory organization and is fully non-blocking. It can support a total aggregate throughput of 1.2 gigabytes per second, organized in any combination of up to 32 155 Mb/sec, eight 622 Mb/sec, or four 1.2 Gb/sec full-duplex links. The switch can be fabricated on a single chip, and includes an internal 4 MB memory buffer capable of storing over 85,000 cells. When combined with external support circuitry, the switch is competitive with commercial offerings in its feature set, and significantly less expensive than existing solutions. The switch is targeted to WAN infrastructure applications such as wide-area Internet access, data backbones, and digital telephony, where we feel untapped markets exist, but it is also usable for ATM-based LANs and even could be modified to penetrate the potentially lucrative Fast and Gigabit Ethernet markets. %X tr-07-97.ps.gz %R TR-08-97 %D 1997 %T Evaluation of Two Connectionist Approaches to Stack Representation %A Rebecca Hwa %X This study empirically compares two distributed connectionist learning models trained to represent an arbitrarily deep stack. One is Pollack's Recursive Auto-Associative Memory, a recurrent, back-propagating neural network that uses a hidden intermediate representation. The other is the Exponential Decay Model, a novel framework we propose here, that trains the network to explicitly model the stack as an exponentially decaying entity. We show that although the concept of a stack is learnable for both approaches, neither is able to deliver the arbitrary depth attribute. Ultimately, both suffer from the rapid rate of error propagation inherent in their recursive structures. %X tr-08-97.ps.gz %R TR-09-97 %D 1997 %T Learning Action Strategies for Planning Domains %A Roni Khardon %X This paper reports on experiments where techniques of supervised machine learning are applied to the problem of planning. The input to the learning algorithm is composed of a description of a planning domain, planning problems in this domain, and solutions for them. The output is an efficient algorithm - a strategy - for solving problems in that domain. We test the strategy on an independent set of planning problems from the same domain, so that success is measured by its ability to solve complete problems. A system, L2Act, has been developed in order to perform these experiments. \par We have experimented with the blocks world domain, and the logistics domain, using strategies in the form of a generalization of decision lists, where the rules on the list are existentially quantified first order expressions. The learning algorithm is a variant of Rivest`s (1987) algorithm, improved with several techniques that reduce its time complexity. As the experiments demonstrate, generalization is achieved so that large unseen problems can be solved by the learned strategies. The learned strategies are efficient and are shown to find solutions of high quality. We also discuss preliminary experiments with linear threshold algorithms for these problems. %X tr-09-97.ps.gz %R TR-10-97 %D 1997 %T L2Act User Manual %A Roni Khardon %X This note describes the system L2Act, the options it includes, and how to use it. We assume knowledge of the general ideas behind the system, as well as some details on the implementation described in TR-28-95 and TR-09-97. %X tr-10-97.ps.gz %R TR-11-97 %D 1997 %T Similarity-Based Approaches to Natural Language Processing %A Lillian Jane Lee %X Statistical methods for automatically extracting information about associations between words or documents from large collections of text have the potential to have considerable impact in a number of areas, such as information retrieval and natural-language-based user interfaces. However, even huge bodies of text yield highly unreliable estimates of the probability of relatively common events, and, in fact, perfectly reasonable events may not occur in the training data at all. This is known as the {\em sparse data problem}. Traditional approaches to the sparse data problem use crude approximations. We propose a different solution: if we are able to organize the data into classes of similar events, then, if information about an event is lacking, we can estimate its behavior from information about similar events. This thesis presents two such similarity-based approaches, where, in general, we measure similarity by the Kullback-Leibler divergence, an information-theoretic quantity. \par Our first approach is to build soft, hierarchical clusters: soft, because each event belongs to each cluster with some probability; hierarchical, because cluster centroids are iteratively split to model finer distinctions. Our clustering method, which uses the technique of deterministic annealing, represents (to our knowledge) the first application of soft clustering to problems in natural language processing. We use this method to cluster words drawn from 44 million words of Associated Press Newswire and 10 million words from Grolier's encyclopedia, and find that language models built from the clusters have substantial predictive power. Our algorithm also extends with no modification to other domains, such as document clustering. \par Our second approach is a nearest-neighbor approach: instead of calculating a centroid for each class, we in essence build a cluster around each word. We compare several such nearest-neighbor approaches on a word sense disambiguation task and find that as a whole, their performance is far superior to that of standard methods. In another set of experiments, we show that using estimation techniques based on the nearest-neighbor model enables us to achieve perplexity reductions of more than 20 percent over standard techniques in the prediction of low-frequency events, and statistically significant speech recognition error-rate reduction. %X tr-11-97.ps.gz %R TR-12-97 %D 1997 %T An Analysis of Issues Facing World Wide Web Servers %A Stephen Lee Manley %X The World Wide Web has captured the public's interest like no other computer aplication or tool. In response, businesses have attempted to capitalize on the Web's popularity. As a result propaganda, assumption, and unfounded theories have taken the place of facts, scientific analysis, and well-reasoned theories. As with all things, the Web's popularity comes with a price for the first time the computer industry must satisfy exponentially increasing demand. As the World Wide Web becomes the "World Wide Wait" and the Internet changes from the "Information Superhighway" to a "Giant Traffic Jam," the public demands improvements in Web performance. /par The lack of cogent scientific analysis prevents true improvement in Web conditions. Nobody knows the source of the bottlenecks. Some assert that the server must be the problem. Others blame Internet congestion. Still others place the blame on modems or slow Local Area Networks. The Web's massive size and growth have made research difficult, but those same factors make such work indispensable. /par This thesis examines issues facing the Web by focusing on traffic patterns on a variety of servers. The thesis presents a method of categorizing different Web site growth patterns. It then disproves the theory that CGI has become an important and varied tool on most Web sites. Most importantly, however, the thesis focuses on the source of latency on the Web. An in-depth examination of the data leads to the conclusion that the server cannot be a primary source of latency on the World Wide Web. \par The thesis then details the creation of a new realistic, self-configuring, scaling Web server benchmark. By using a site's Web server logs, the benchmark can create a model of the site's traffic. The model can be reduced by a series of abstractions, and scaled to predict future behavior. Finally, the thesis shows the benchmark models realistic Web server traffic, and can serve as a tool for scientific analysis of developments on the Web, and their effects on the server. %X tr-12-97.ps.gz %R TR-13-97 %D 1997 %T Data Parallel Performance Optimizations Using Array Aliasing %A Y. Charlie Hu %A S. Lennart Johnsson %X The array aliasing mechanism provided in the Connection Machine Fortran (CMF) language and run--time system provides a unique way of identifying the memory address spaces local to processors within the global address space of distributed memory architectures, while staying in the data parallel programming paradigm. We show how the array aliasing feature can be used effectively in optimizing communication and computation performance. The constructs we present occur frequently in many scientific and engineering applications, and include various forms of aggregation and array reshaping through array aliasing. The effectiveness of the optmization techniques is demonstrated on an implementation of Anderson's hierarchical $O(N)$$ $N$--body method. %X tr-13-97.ps.gz %R TR-14-97 %D 1997 %T Efficient Data Parallel Implementations of Highly Irregular Problems %A Yu Hu %X This dissertation presents optimization techniques for efficient data parallel formulation/implementation of highly irregular problems, and applies the techniques to $O(N)$ hierarchical \Nbody methods for large--scale \Nbody simulations. It demonstrates that highly irregular scientific and engineering problems such as nonadaptive and adaptive $O(N)$ hierarchical \Nbody methods can be efficiently implemented in high--level data parallel languages such as High Performance Fortran (HPF) on scalable parallel architectures. It also presents an empirical study of the accuracy--cost tradeoffs of $O(N)$ hierarchical \Nbody methods. \par This dissertation first develops optimization techniques for efficient data parallel implementation of irregular problems, focusing on minimizing the data movement through careful management of the data distribution and the data references, both between the memories of different nodes, and within the memory hierarchy of each node. For hierarchical \Nbody methods, our optimizations on improving arithmetic efficiencies include recognizing dominating computations as matrix--vector multiplications and aggregating them into multiple--instance matrix--matrix multiplications. Experimental results with an implementation in Connection Machine Fortran of Anderson's hierarchical \Nbody method demonstrate that performance competitive to that of the best message--passing implementations of the same class of methods can be achieved. \par The dissertation also presents a general data parallel formulation for highly irregular applications, and applies the formulation to an adaptive hierarchical \Nbody method with highly nonuniform particle distributions. The formulation consists of (1) a method for linearizing irregular data structures, (2) a data parallel implementation (in HPF) of graph partitioning algorithms applied to the linearized data structure, and (3) techniques for expressing irregular communications and nonuniform computations associated with the elements of linearized data structures. Experimental results demonstrate that efficient data parallel (HPF) implementations of highly nonuniform problems are feasible with proper language/compiler/runtime support. Our data parallel $N$--body code provides a much needed ``benchmark'' code for evaluating and improving HPF compilers. \par This thesis also develops the first data parallel (HPF) implementation of the geometric partitioning algorithm due to Miller, Teng, Thurston and Vavasis -- one of the only two provably good partitioning schemes. Our data parallel formulation makes extensive use of segmented prefix sums and parallel selections, and provides a data parallel procedure for geometric sampling. Experiments on partitioning particles for load--balance and data interactions as required in hierarchical \Nbody algorithms show that the geometric partitioning algorithm has an efficient data parallel formulation. \par Finally, this thesis studies the accuracy--cost tradeoffs of $O(N)$ hierarchical \Nbody methods using our implementation of nonadaptive Anderson's method. The various parameters which control the degree of approximation of the computational elements and separateness of the interacting computational elements, govern both the arithmetic complexity and the accuracy of the methods. A scheme for choosing optimal parameters that give the best running time for a prescribed error requirement is developed. Using this scheme, we find that for a prescribed error, using a near--field containing only nearest neighbor boxes and the optimal hierarchy depth which minimizes the number of arithmetic operations, minimizes the number of arithmetic operations and therefore the total running time. %X tr-14-97.ps.gz %R TR-15-97 %D 1997 %T The Computational Processing of Intonational Prominence: A Functional Prosody Perspective %A Christine Hisayo Nakatani %X tr-15-97.ps.gz %R TR-16-97 %D 1997 %T Does Systems Research Measure Up? %A Christopher Small %A Narendra Ghosh %A Hany Saleeb %A Margo Seltzer %A Keith Smith %X We surveyed more than two hundred systems research papers published in the last six years, and found that, in experiment after experiment, systems researchers measure the same things, but in the majority of cases the reported results are not reproducible, comparable, or statistically rigorous. In this paper we present data describing the state of systems experimentation and suggest guidelines for structuring commonly run experiments, so that results from work by different researchers can be compared more easily. We conclude with recommendations on how to improve the rigor of published computer systems research. %X tr-16-97.ps.gz %R TR-17-97 %D 1997 %T A Self-Scaling and Self-Configuring Benchmark for Web Servers %A Stephen Manley, Network Appliance %A Michael Courage, Microsoft Corporation %A Margo Seltzer, Harvard University %X World Wide Web clients and servers have become some of the most important applications in our computing base today, and as such, we need realistic and meaningful ways of capturing their performance. Current server benchmarks do not capture the wide variation that we see in servers and are not accurate in their characterization of web traffic. In this paper, we present a self-configuring, scalable benchmark, that generates a server benchmark load based on actual server loads. In contrast to other web benchmarks, our benchmark characterizes request latency instead of focusing exclusively on throughput sensitive metrics. We present our new benchmark, hbench:Web, and demonstrate how it accurately captures the load observed by an actual server. We then go on to show how it can be used to assess how continued growth or changes in the workload will affect future performance. Using existing log histories, we show that these predictions are sufficiently realistic to provide insight into tomorrow's web performance. %X tr-17-97.ps.gz %R TR-18-97 %D 1997 %T Issues in Extensible Operating Systems %A Margo I. Seltzer %A Yasuhiro Endo %A Christopher Small %A Keith A. Smith %X Operating systems research has traditionally consisted of adding functionality to the operating system or inventing and evaluating new methods for performing functions. Regardless of the research goal, the single constant has been that the size and complexity of operating systems increase over time. As a result, operating systems are usually the single most complex piece of software in a computer system, containing hundreds of thousands, if not millions, of lines of code. Today's operating system research is directed at finding new ways to structure the operating system in order to increase its flexibility, allowing it to adapt to changes in the application set it must support. This paper discusses the issues involved in designing such extensible systems and the array of choices facing the operating system designer. We present a framework for describing extensible operating systems and then relate current operating systems to this framework. %X tr-18-97.ps.gz %R TR-19-97 %D 1997 %T Projection Learning %A Leslie G. Valiant %X A method of combining learning algorithms is described that preserves attribute efficiency. It yields learning algorithms that require a number of examples that is polynomial in the number of relevant variables and logarithmic in the number of irrelevant ones. The algorithms are simple to implement and realizable on networks with a number of nodes linear in the total number of variables. They can be viewed as strict generalizations of Littlestone's Winnow algorithm, and, therefore, appropriate to domains having very large numbers of attributes, but where nonlinear hypotheses are sought. %X tr-19-97.ps.gz %R TR-20-97 %D 1997 %T The Mug-Shot Search Problem %A Ellie Baker %A Margo Seltzer %X Mug-shot search is the classic example of the general problem of searching a large facial image database when starting out with only a mental image of the sought-after face. We have implemented a prototype content-based image retrieval system that integrates composite face creation methods with a face-recognition technique (Eigenfaces) so that a user can both create faces and search for them automatically in a database. These two functions are fully integrated so that interim created com posites may be used to search the data and interim search results may, likewise, be used to modify an evolving composite. \par Although the Eigenface method has been studied extensively for its ability to perform face identification tasks (in which the input to the system is an on-line facial image to identify), little research has been done to determine how effective it is as applied to the mug shot search problem (in which there is no on-line input image at the outset). With our prototype system, we have conducted a pilot user study that looks at the usefulness of eigenfaces as applied to this problem. The study shows that the eigenface method, though helpful, is an imperfect model of human perception of similarity between faces. Using a novel evaluation methodology, we have made progress at identifying specific search strategies that, given an imperfect correlation between the system and human similarity metrics, use whatever correlation does exist to the best advantage. We have also shown that the use of facial composites as query images is advantageous compared to restricting users to database images for their queries. %X tr-20-97.ps.gz %R TR-21-97 %D 1997 %T Quality and Speed in Linear-scan Register Allocation %A Omr Traub %A Glenn Holloway %A Michael Smith %X A linear-scan algorithm directs the global allocation of register candidates to registers based on a simple linear sweep over the program being compiled. This approach to register allocation makes sense for systems, such as those for dynamic compilation, where compilation speed is important. In contrast, most commercial and research optimizing compilers rely on a graph-coloring approach to global register allocation. In this paper, we compare the performance of a linear-scan method against a modern graph-coloring method. We implement both register allocators within the Machine SUIF extension of the Stanford SUIF compiler system. Experimental results show that linear scan is much faster than coloring on benchmarks with large numbers of register candidates. We also describe improvements to the linear-scan approach that do not change its linear character, but allow it to produce code of a quality near to that produced by graph coloring. %X tr-21-97.ps.gz %R TR-22-97 %D 1997 %T The Evolution of SharedPlans %A Barbara J. Grosz %A Sarit Kraus %X tr-22-97.ps.gz %R TR-23-97 %D 1997 %T Computer-Human Collaboration in the Design of Graphics %A Kathleen Ryall %X Delineating the roles of the user and the computer in a system is a central task in user interface design. As interactive applications become more complex, it is increasingly difficult to design interface methods that deliver the full power of an application to users, while enabling them to learn and to use effectively the interface to a system. The challenge is finding a balance between user intervention and computer control within the computer-user interface. \par In this thesis, we propose a new paradigm for determining this division of labor, which attempts to exploit the strengths of each collaborator, the human user and the computer system. This collaborative framework encourages the development of semi-automatic systems, through which users can explore a large number of candidate solutions, while evaluating and comparing various alternatives. Under the collaborative framework, the problem to be solved is framed as an optimization problem, which is then decomposed into local and global portions. The user is responsible for global aspects of the problem: placing the computer into different areas of the search space, and determining when an acceptable solution has been reached. The computer works at the local level, computing the local minima, displaying results to the user, and providing simple interface mechanisms to facilitate the interaction. Systems employing this approach make use of task-specific information to leverage the actions of users, performing fine-grained details while leaving the high-level aspects to the user specifiable through gross interface gestures. \par We present applications of our collaborative paradigm to the design and implementation of semi-automatic systems for three tasks from the domain of graphic design: network diagram layout, parameter specification for computer graphics algorithms and floor plan segmentation. The collaborative paradigm we propose is well-suited for this domain. Systems designed under our framework support an iterative design process, an integral component of graphic design. Furthermore, the collaborative framework for computer-user interface design exploits people's expertise at incorporating aesthetic criteria and semantic information into finding an acceptable solution for a graphic design task, while harnessing a computer's computational power, to enable users to explore a large space of candidate solutions. %X tr-23-97.ps.gz %R tr-24-97 %D 1997 %T Fast Parallel Orthogonal Transforms %A Nadia Shalaby %X Orthogonal transforms are ubiquitous in engineering and applied science applications. Such applications place stringent requirements on a computer system, either in terms of time, which translates into processing power, or problem size, which translates into memory size, or most often both. To achieve such high performance, these applications have been migrating to the realm of parallel computing, offering significantly more processing power and memory. This trend mandates the development of fast parallel algorithms for orthogonal transforms, that would be versatile on many parallel platforms, and is the subject of this thesis. \par We present a unified approach in seeking fast parallel orthogonal transforms by establishing a theoretical formulation for each algorithm, presenting a simple specification to facilitate implementation and analysis, and by evaluating performance via a set of predefined arithmetic, memory, communication and load--imbalance complexity metrics. We adopt a bottom up approach by constructing the thesis in the form of progressively larger modular blocks, each relying on a lower level block for its formulation, algorithmic specification and performance analysis. \par Since communication is the bottleneck for parallel computing, we first establish an algebraic framework for stable parallel permutations, the predominant communication requirement for parallel orthogonal transforms. We present a taxonomy categorizing stable permutations into classes of index--digit, linear, translation, affine and polynomial permutations. For each class, we demonstrate its general behavioral properties and prove permutation locality and other properties for particular examples. \par These results are then applied in formulating, specifying and evaluating performance of the direct and load--balanced parallel algorithms for the fast Fourier transform (FFT); we prove the latter to be optimal. We demonstrate the versatility of our complexity metrics by substituting them into the PRAM, LPRAM, BSP and LogP computational models. Consequently, we employ the load--balanced FFT in deriving a novel polynomial--based discrete cosine transform (DCT) and demonstrate its performance advantage over the classical DCT. Finally, the polynomial DCT is used as a building block for a novel parallel fast Legendre transform (FLT), a parallelization of the Driscoll--Healey O(N log^2 N) method. We show that the algorithm is hierarchically load--balanced by composing its specification and complexity from its modular blocks. %X tr-24-97.ps.gz %R TR-01-98 %D 1998 %T A Seed-Growth Heuristic for Graph Bisection %A Joe Marks %A Wheeler Ruml %A Stuart M. Shieber %A J.Thomas Ngo %X We present a new heuristic algorithm for graph bisection, based on an implicit notion of clustering. We describe how the heuristic can be combined with stochastic search procedures and a postprocess application of the Kernighan-Lin algorithm. In a series of time-equated comparisons with large-sample runs of pure Kernighan-Lin, the new algorithm demonstrates significant superiority in terms of the best bisections found. %X tr-01-98.ps.gz %R TR-02-98 %D 1998 %T Activity Graph %A Michael Karr %X This paper discusses activity graphs, the mathematical formalism underlying the Activity Coordination System, a process-centered system for collaborative work. The activity graph model was developed for the purpose of describing, tracking, and guiding distributed, communication-intensive activities. Its graph-based semantics encompasses the more familiar Petri nets, but has several novel properties. For example, it is possible to impose multiple hierarchies on the same graph, so that the hierarchy with which an activity is described does not have to be the one with which it is viewed. The paper also discusses very briefly some aspects of the system implementation. %X tr-02-98.ps.gz %R TR-03-98 %D 1998 %T An Activity Coordination System %A Michael Karr %A Thomas E. Cheatham, Jr. %X tr-03-98.ps.gz %R TR-04-98 %D 1998 %T A Neuroidal Architecture for Cognitive Computation %A Leslie G. Valiant %X An architecture is described for designing systems that acquire and manipulate large amounts of unsystematized, or so-called commonsense, knowledge. Its aim is to exploit to the full those aspects of computational learning that are known to offer powerful solutions in the acquisition and maintenance of robust knowledge bases. The architecture makes explicit the requirements on the basic computational tasks that are to be performed and is designed to make these computationally tractable even for very large databases. The main claims are that (i) the basic learning tasks are tractable and (ii) tractable learning offers viable approaches to a range of issues that have been previously identified as problematic for artificial intelligence systems that are entirely programmed. In particular, attribute efficiency holds a central place in the definition of the learning tasks, as does also the capability to handle relational information efficiently. Among the issues that learning offers to resolve are robustness to inconsistencies, robustness to incomplete information and resolving among alternatives. %X tr-04-98.ps.gz %R TR-05-98 %D 1998 %T E-L Rational %A Glenn Holloway %A Steve Squires %X REPORT WITHDRAWN %R TR-06-98 %D 1998 %T An Empirical Evaluation of Probabilistic Lexicalized Tree Insertion Grammer %A Rebecca Hwa %X We present an empirical study of the applicability of Probabilistic Lexicalized Tree Insertion Grammars (PLTIG), a lexicalized counterpart to Probabilistic Context-Free Grammars (PCFG), to problems in stochastic natural-language processing. Comparing the performance of PLTIGs with non-hierarchical $N$-gram models and PCFGs, we show that PLTIG combines the best aspects of both, with language modeling capability comparable to $N$-grams, and improved parsing performance over its non-lexicalized counterpart. Furthermore, training of PLTIGs displays faster convergence than PCFGs. %X tr-06-98.ps.gz %R TR-07-98 %D 1998 %T Parsing Inside-Out %A Joshua Goodman %X The inside-outside probabilities are typically used for reestimating Probabilistic Context Free Grammars (PCFGs), just as the forward-backward probabilities are typically used for reestimating HMMs. In this thesis, I show several novel uses, including improving parser accuracy by matching parsing algorithms to evaluation criteria; speeding up DOP parsing by 500 times; and 30 times faster PCFG thresholding at a given accuracy level. I also give an elegant, state-of-the-art grammar formalism, which can be used to compute inside-outside probabilities; and a parser description formalism, which makes it easy to derive inside-outside formulas and many others. %X tr-07-98.ps.gz %R TR-08-98 - This report has been withdrawn %D 1998 %R TR-09-98 %D 1998 %T A Statistical Analysis of User-Specific Profiles %A Zheng Wang %A Norm Rubin %X This technical report examines common assumptions about computer users in profile-based optimization. We study execution profiles of interactive applications on Windows NT to understand how different users use the same program. The profiles were generated by the DIGITAL FX!32 emulator/binary translator system, which automatically runs the x86 version of Windows NT programs on NT/Alpha computers. Our statistical analysis indicates that people use the benchmark programs in different ways. This technical report is a supplement to the paper "Evaluating the Importance of User-Specific Profiling," to appear in "Proceedings of the 2nd USENIX Windows NT Symposium," USENIX Association, August 1998. %X tr-09-98.ps.gz %R TR-10-98 %D 1998 %T An Empirical Study of Smoothing Techniques for Language Modeling %A Stanley F. Chen %A Joshua Goodman %X We present a tutorial introduction to $n$-gram models for language modeling and survey the most widely-used smoothing algorithms for such models. We then present an extensive empirical comparison of several of these smoothing techniques, including those described by Jelinek and Mercer(1980), Katz (1987), Bell Cleary and Witten (1990), Ney, Essen, and Kneser (1994), and Kneser and Ney (1995). We investigate how factors such as training data size, training corpus (e.g., Brown versus Wall Street Journal), count cutoffs, and n-gram order (bigram versus trigram) affect the relative performance of these methods, which is measured through the cross-entropy of test data. Our results show that previous comparisons have not been complete enough to fully characterize smoothing algorithm performance. We introduce methodologies for analyzing smoothing algorithm efficacy in detail, and using these techniques we motivate a novel variation of Kneser-Ney smoothing that consistently outperforms all other algorithms evaluated. Finally, results showing that improved language model smoothing leads to improved speech recognition performance are presented. %X tr-10-98.ps.gz %R TR-11-98 %D 1998 %T Expectation Value of the Lowest of a Set of Randomly Selected Integers %A Adolph Baker %A Ellen Baker %X Consider the set of positive integers 0, 1, 2, ..., D. If we pick N of them at random, where N < (D+1), what is the expectation (or average value) of the lowest-valued of the N picks? We briefly describe the image database search question that gave rise to this problem, and present a proof that the answer is (D-N+1)/(N+1). %X tr-11-98.ps.gz %R TR-12-98 %D 1998 %T CS50 1997 Quantitative Study %A Dan Ellard %X A discussion and analysis of a quantitative study of CS50, Harvard's introductory Computer Science course. It describes a system for passively monitoring and gathering data about the use of various programming utilities by students, gives an overview of the data gathered with this system, and an analysis of the relationship between student work habits and other factors and success in CS50. Finally, it discusses ideas for how this research could be extended and continued. \par This research was performed as part of Dan Ellard's qualifying examination. %Xtr-12-98.ps.gz %R TR-13-98 %D 1998 %T The ANT Architecture -- An Architecture for CS1 %A Daniel J. Ellard %A Penelope A. Ellard %A James M. Megquier %A J. Bradley Chen %A Margo I. Seltzer %X An overview of the pedagogical and philosophical motivations for teaching machine architecture in CS1, and a description of the ANT architecture, which was specifically designed for use in our introductory computer science course. %X tr-13-98.ps.gz %R TR-14-98 %D 1998 %T The ANT Architecture -- An Architecture For CS1 %A Daniel J. Ellard %A Penelope A. Ellard %A James M. Megquier %A J. Bradley Chen %X A description of the ANT architecture, how we use ANT in our CS1 curriculum, and the pedagogical issues that motivated the creation and design of the ANT architecture. %X tr-14-98.ps.gz %R TR-15-98 %D 1998 %T Data Representation and Assembly Language Programming The ANT-97 Architecture %A Daniel J. Ellard %A Penelope A. Ellard %X A tutorial for data representation (principally two's complement notation and arithmetic and ASCII codes) and assembly language programming using the ANT architecture. This tutorial has been used in Harvard's CS1 course (1996-1998). %X tr-15-98.ps.gz %R TR-16-98 %D 1998 %T The Mug-Shot Search Problem A Study of the Eigenface Metric, Search Strategies, and Interfaces in a System for Searching Facial Image Data %A Ellen Jill Baker %X This thesis presents an investigation of methods for conducting an efficient look-up in a pictorial "phonebook" (i.e., a facial image database). Although research on efficient "mug-shot search" is under way, little has yet been done to evaluate the effectiveness of various proposed techniques, and much work remains before systems as practical or ubiquitous as phonebooks are attainable. The thesis describes a prototype system based on the idea of combining a composite face creation method with a face-recognition technique, so that a user may create a facial image and then automatically locate other similar-looking faces in the database. Several methods for evaluating such a system are presented as well as the results and analysis of a user-study employing the methods. \par Three basic system components are considered and evaluated: the metric for determining which faces are most similar in appearance to a given "query" face, the interface for producing the query face, and the search strategy. The data demonstrate that the Eigenface metric is a useful (though imperfect) model of human perception of similarity between faces. The data also show how the lack of agreement among people about which faces are most similar to a query limits what can be reasonably expected from any metric. Via simulation, it is demonstrated that, if indeed there were a single human metric for assessing facial similarity, and if the Eigenface metric correlated perfectly with this human metric, then simple interactive hill-climbing in the space of the database images would be an excellent search strategy, capable of reducing the average number of image inspections required in a search to about 2% of the database. But this superiority of hill-climbing in principle is not sustained in practice, given the observed level of correlation between the Eigenface similarity metric and the "human" one. The average number of image inspections required for the hill-climbing strategy was, in fact, closer to 35% of the database. While this represents an improvement over the 50% required on average for a simple sequential search of the data, it is still insufficient for practical use. However, given the actual performance of the Eigenface metric, the study data show that a non-iterative strategy of constructing a single query image that is a composite of selected features from 100 random database faces is a better approach, reducing the average number of image inspections to about 20% of the database. These and other examples demonstrate and quantify the benefits of an interface in which the Eigenface metric is combined with a composite creation system. %X tr-16-98.ps.gz %R TR-01-99 %D 1999 %T Silhouette Mapping %A Xianfeng Gu %A Steven Gortler %A Hugues Hoppe %A Leonard McMillan %A Benedict J. Brown %A Abraham D. Stone %X Recent image-based rendering techniques have shown success in approximating detailed models using sampled images over coarser meshes. One limitation of these techniques is that the coarseness of the geometric mesh is apparent in the rough polygonal silhouette of the rendering. In this paper, we present a scheme for accurately capturing the external silhouette of a model in order to clip the approximate geometry. \par Given a detailed model, silhouettes sampled from a discrete set of viewpoints about the object are collected into a silhouette map. The silhouette from an arbitrary viewpoint is then computed as the interpolation from three nearby viewpoints in the silhouette map. Pairwise silhouette interpolation is based on a visual hull approximation in the epipolar plane. The silhouette map itself is adaptively simplified by removing views whose silhouettes are accurately predicted by interpolation of their neighbors. The model geometry is approximated by a progressive hull construction, and is rendered using projected texture maps. The 3D rendering is clipped to the interpolated silhouette using stencil planes. %X tr-01-99.ps.gz %R TR-02-99 %D 1999 %T Speculative Pruning for Boolean Satisfiability %A Wheeler Ruml, Adam Ginsburg, Stuart Shieber %X Much recent work on boolean satisfiability has focussed on incomplete algorithms that sacrifice accuracy for improved running time. Statistical predictors of satisfiability do not return actual satisfying assignments, but at least two have been developed that run in linear time. Search algorithms allow increased accuracy with additional running time, and can return satisfying assignments. The efficient search algorithms that have been proposed are based on iteratively improving a random assignment, in effect searching a graph of degree equal to the number of variables. In this paper, we examine an incomplete algorithm based on searching a standard binary tree, in which statistical predictors are used to speculatively prune the tree in constant time. Experimental evaluation on hard random instances shows it to be the first practical incomplete algorithm based on tree search, surpassing even graph-based methods on smaller instances. %X tr-02-99.ps.gz %R TR-03-99 %D 1999 %T Self-Monitoring in VINO %A Hany S. Saleeb %X Computer system performance is a measure of how well the operating system shares hardware and software resources among the various applications that are running on it. The goal of performance monitoring in the VINO extensible operating system is to make recommendations for improving application performance. This is accomplished by collecting system data through a monitoring agent, automatically identifying conditions causing performance degradation, and presenting evidence to support its conclusions. Once the operating system is well monitored, it may be able to tune itself to improve system performance. \par Within the framework of VINO, I describe a system that monitors itself and gathers information about its performance. I show that system monitoring is advantageous for two reasons. First, through the use of two user applications, it is capable of warning designers of application bottlenecks and system degradation. Second, the operating system can dynamically self-adapt its own kernel behavior and policies after monitoring access patterns. Thus, the monitoring system aids the application designer in defining performance limitations and adapts kernel policies to improve overall system performance. %X tr-03-99.ps.gz %R TR-04-99 %D 1999 %T A Collaborative Approach to Newspaper Layout %A Benjamin Lubin %X tr-04-99.ps.gz %R TR-05-99 %D 1999 %T AnGraf Creating Custom Animated Data Graphics %A Daniel Dias %X tr-05-99.ps.gz %R TR-06-99 %D 1999 %T Creating Socially Conscious Agents: Decision-Making in the Context of Group Commitments %A Alyssa Glass %X With growing opportunities for individually motivated agents to work collaboratively to satisfy shared goals, it becomes increasingly important to design agents that can make intelligent decisions in the context of commitments to group activities. In particular, agents need to be able to reconcile their intentions to do group-related actions with other, conflicting actions. In this thesis, I present the framework for the SPIRE experimental system that allows the process of intention reconciliation in team contexts to be simulated and studied. I define a measure of social consciousness and show how it can be incorporated into the SPIRE system. Using SPIRE, I then investigate the effect of infinite and limited time horizons on agents with varying levels of social consciousness, as well as the resulting effect on the utility of the group as a whole. Using these experiments as a basis for theoretic conclusions, I suggest preliminary principles for designers of collaborative agents. %X tr-06-99.ps.gz %R TR-07-99 %D 1999 %T Logging versus Soft Updates: Asynchronous Meta-data Protection in File Systems in File Systems %A Margo Seltzer %A Gregory Ganger %A M. Kirk McKusick %A Keith A. Smith %A Craig Soules %A Christopher Stein %X The UNIX Fast File System (FFS) is probably the most widely-used file system for performance comparisons. However, such comparisons frequently overlook many of the performance enhancements that have been added over the past decade. In this paper, we explore the two most commonly used approaches for improving the performance of meta-data operations and recovery: logging and Soft Updates. \par The commercial sector has moved en masse to logging file systems, as evidenced by their presence on nearly every server platform available today: Solaris, AIX, Digital UNIX, HP-UX, Irix and Windows NT. On all but Solaris, the default file system uses logging. In the meantime, Soft Updates holds the promise of providing stronger reliability guarantees than logging, with faster recovery and superior performance in certain boundary cases. \par In this paper, we explore the benefits of both Soft Updates and logging, comparing their behavior on both microbenchmarks and workload-based macrobenchmarks. We find that logging alone is not sufficient to "solve" the meta-data update problem. If synchronous semantics are required (i.e., meta-data operatinos are durable once the system call returns), then the logging systems cannot realize their full potential. Only when this synchronicity requirement is relaxed can logging systems approach the performance of systems like Soft Updates. Our asynchronous logging and Soft Updates systems perform comparably in most cases. While Soft Updates excels in some meta-data intensive microbenchmarks, it outperforms logging on only two of the four workloads we examined and performs less well on one. %X tr-07-99.ps.gz %R TR-08-99 %D 1999 %T The Asymptotics of Selecting the Shortest of Two, Improved %A Michael Mitzenmacher and Berhold Voecking %X We investigate variations of a novel, recently proposed load balancing scheme based on small amounts of choice. The static (hashing) setting is modeled as a balls-and-bins process. The balls are sequentially placed into bins, with each ball selecting $d$ bins randomly and going to the bin with the fewest balls. A similar dynamic setting is modeled as a scenario where tasks arrive as a Poisson process at a bank of FIFO servers and queue at one for service. Tasks probe a small random sample of servers in the bank and queue at the server with the fewest tasks. /par Recently it has been shown that breaking ties in a fixed, asymmetric fashion, actually improves performance, whereas in all previous analyses, ties were broken randomly. We demonstrate the nature of this improvement using fluid limit models, suggest further improvements, and verify and quantify the improvement through simulations. %X tr-08-99.ps.gz %R TR-09-99 %D 1999 %T Improving Interactive System Performance using TIPME %A Yasuhiro Endo %A Margo Seltzer %X On the vast majority of today's computers, the dominant form of computation is GUI-based user interaction. In such an environment, the user's perception is the final arbiter of performance. Human-factors research shows that a user's perception of performance is affected by unexpectedly long delays. However, most performance-tuning techniques currently rely on throughput-sensitive benchmarks. While these techniques improve the average performance of the system, they do little to detect or eliminate response-time variabilities -- in particular, unexpectedly long delays. \par We introduce a measurement methodology that improves user-perceived performance by helping us to identify and eliminate the causes of the unexpected long response times that users find unacceptable. We describe TIPME (The Interactive Performance Monitoring Environment), a collection of measurement tools that implements this methodology, and we present two case studies that demonstrate its effectiveness. Each of the performance problems we identify drastically affects variability in response time in a mature system, demonstrating that current tuning techniques do not address this class of performance problems. %X tr-09-99.ps.gz %R TR-10-99 %D 1999 %T Seed-Growth Heuristics for Graph Bisection %A Wheeler Ruml %A Joseph Marks %A Stuart M. Shieber %A J. Thomas Ngo %X We investigate a family of algorithms for graph bisection that are based on a simple local connectivity heuristic, which we call seed-growth. We show how the heuristic can be combined with stochastic search procedures and a postprocess application of the Kernighan-Lin algorithm. In a series of time-equated comparisons against large-sample runs of pure Kernighan-Lin, the new algorithms find bisections of the same or superior quality. Their performance is particularly good on structured graphs representing important industrial applications. An appendix provides further favorable comparisons to other published results. Our experimental methodology and extensive empirical results provide a solid foundation for further empirical investigation of graph-bisection algorithms. %X tr-10-99.ps.gz %R TR-11-99 %D 1999 %T Socially Conscious Decision-Making %A Alyssa Glass %A Barbara Grosz %X The growing need for individually motivated agents to work collaboratively to satisfy shared goals has made it increasingly important to design agents that can make intelligent decisions in the context of commitments to group activities. Agents need to reconcile their intentions to do group-related actions with other, conflicting actions. We describe the SPIRE experimental system which allows the process of intention reconciliation in team contexts to be simulated and studied. We define a measure of social consciousness, discuss its incorporation into the SPIRE system, and present several experiments that investigate the interaction in decision-making of measures of group and individual good. In particular, we investigate the effect of infinite and limited time horizons, different task densities, and varying levels of social consciousness on the utility of the group and the individuals it comprises. A key finding is that an intermediate level of social consciousness yields better results than an extreme commitment. We suggest preliminary principles for designers of collaborative agents based on the results. %X tr-11-99.ps.gz %R TR-12-99 %D 1999 %T Improving Interactive System Performance using TIPME %A Yasuhiro Endo %X tr-12-99.ps.gz %R TR-13-99 %D 1999 %T A Resource Management Framework for Central Servers %A David G. Sullivan %A Margo I. Seltzer %X Proportional-share resource management is becoming increasingly important in today's computing environments. In particular, the growing use of the computational resources of central service providers argues for a proportional-share approach that allows clients to obtain resource shares that reflect their relative importance. In such environments, clients must be isolated from one another to prevent the activities of one client from impinging on the resource rights of others. However, such isolation limits the flexibility with which resource allocations can be modified to reflect the actual needs of clients. We present extensions to the lottery-scheduling resource-management framework that increase its flexibility while preserving its ability to provide secure isolation. To demonstrate how this extended framework safely overcomes the limits imposed by existing proportional-share schemes, we have implemented a prototype system that uses the framework to manage CPU time, physical memory, and disk bandwidth. We present the results of experiments that evaluate the prototype, and we show that our framework enables clients of central servers to achieve significant improvements in performance. %X tr-13-99.ps.gz %R TR-14-99 %D 1999 %T Operating System Support for Multi-User, Remote, Graphical Interaction %A Alexander Ya-Li Wong %A Margo Seltzer %X The rising popularity of thin client computing and multi-user, remote, graphical interaction recalls to the fore a range of operating system research issues long dormant, and introduces a number of new directions. \par This paper investigates the impact of operating system design on the performance of thin client service. We contend that the key performance metric for this type of system is user-perceived latency and give a structured approach for investigating operating system design with this criterion in mind. \par In particular, we apply our approach to a quantitative comparison and analysis of Windows NT, Terminal Server Edition (TSE) and Linux with the X Windows System, two popular implementations of thin client service. \par We find that the processor and memory scheduling algorithms in both operating systems are not tuned for thin client service. Under heaavy CPU and memory load, we observed user-perceived latencies of up to 100 times beyond the threshold of perception and even in the idle state these systems induce unneccessary latency. TSE performs particularly poorly despite scheduler modifications to improve interactive responsiveness. We also show that TSE's network protocol outperforms X by up to six times, and also makes use of a bitmap cache which is essential for handling dynamic elements of modern user interfaces and can reduce network load in these cases by up to 2000%. %X tr-14-99.ps.gz %R TR-01-00 %D 2000 %T Selecting Closest Vectors Through Randomization %A Carl Bosley %A Michael O. Rabin %X We consider the problem of finding the closest vectors to a given vector in a large set of vectors, and propose a randomized solution. The method has applications in Automatic Target Recognition (ATR), Web Information Retrieval, and Data Mining. %X tr-01-00.ps.gz %R TR-02-00 %D 2000 %T The Write-Ahead File System: Integrating Kernel and Application Logging %A Chris Stein %X REPORT WITHDRAWN %R TR-03-00 %D 2000 %T Using Multiple Has Functions to Improve IP Lookups %A Michael Mitzenmacher %A Andrei Broder %X High performance Internet routers require a mechanism for very efficient IP address look-ups. Some techniques used to this end, such as binary search on levels, need to construct quickly a good hash table for the appropriate IP prefixes. In this paper we describe an approach for obtaining good hash tables based on using multiple hashes of each input key (which is an IP address). The methods we describe are fast, simple, scalable, parallelizable, and flexible. In particular, in instances where the goal is to have one hash bucket fit into a cache line, using multiple hashes proves extremely suitable. We provide a general analysis of this hashing technique and specifically discuss its application to binary search on levels with prefix expansion. %X tr-03-00.ps.gz %R TR-04-00 %D 2000 %T Quantum versus Classical Learnability %A Rocco Servedio %X This paper studies fundamental questions in computational learning theory from a quantum computation perspective. We consider quantum versions of two well-studied classical learning models: Angluin's model of exact learning from membership queries and Valiant's Probably Approximately Correct (PAC) model of learning from random examples. We give positive and negative results for quantum versus classical learnability. For each of the two learning models described above, we show that any concept class is information-theoretically learnable from polynomially many quantum examples if and only if it is information-theoretically learnable from polynomially many classical examples. In contrast to this information-theoretic equivalence betwen quantum and classical learnability, though, we observe that a separation does exist between {\em efficient} quantum and classical learnability. For both the model of exact learning from membership queries and the PAC model, we show that under a widely held computational hardness assumption for classical computation (the intractability of factoring), there is a concept class which is polynomial-time learnable in the quantum version but not in the classical version of the model. %X tr-04-00.ps.gz %R TR-05-00 %D 2000 %T Multi-Domain Sandboxing: An Overview %A Robert Fischer %X In today's computing world, computer code is most often developed on one computer and run on another. Code is increasingly downloaded and run on a casual basis, as the line between code and data is blurred and executable code is found in web pages, spreadsheets, word processor documents, etc. \par Not having the knowledge or resources to verify the lack of malicious intent of that code, the user must rely on heresay and technological solutions to ensure that casually downloaded code does not damage the user's computer or steal data. \par Building on the past concepts of sandboxing and multi-level security, we propose multi-domain sandboxing. This security system allows programs more flexibility than traditional sandboxing, while preventing them from malicious actions. We propose applications of this new technology to the web, increasing the functionality and security possible in web applications. %X tr-05-00.ps.gz %R TR-06-00 %D 2000 %T Estimating Resemblance of MIDI Documents %A Michael Mitzenmacher %A Sean Owen %X Search engines often employ techniques for determining syntactic similarity of Web pages. Such a tool allows them to avoid returning multiple copies of essentially the same page when a user makes a query. Here we describe our experience extending these techniques to MIDI music files. The music domain requires modification to cope with problems introduced in the musical setting, such as polyphony. Our experience suggests that when used properly these techniques prove useful for determining duplicates and clustering databases in the musical setting as well. %X tr-06-00.ps.gz %R TR-07-00 %D 2000 %T On the Hardness of Finding Optimal Multiple Preset Dictionaries %A Michael Mitzenmacher %XPreset dictionaries for Huffman codes are used effectively in fax transmission and JPEG encoding. A natural extension is to allow multiple preset dictionaries instead of just one. We show, however, that finding optimal multiple preset dictionaries for Huffman and LZ77-based compression schemes is NP-hard. %X tr-07-00.ps.gz %R TR-08-00 %D 2000 %T Towards Compressing Web Graphs %A Micah Adler %A Michael Mitzenmacher %X In this paper, we consider the problem of compressing graphs of the link structure of the World Wide Web. We provide efficient algorithms for such compression that are motivated by recently proposed random graph models for describing the Web. The algorithms are based on reducing the compression problem to the problem of finding a minimum spanning tree in a directed graph related to the original link graph. The performance of the algorithms on graphs generated by the random graph models suggests that by taking advantage of the link structure of the Web, one may achieve significantly better compression than natural Huffman-based schemes. We also provide hardness results demonstrating limitations on natural extensions of our approach. %X tr-08-00.ps.gz %R TR-01-01 %D 2001 %T Communication Timestamps for Filesystem Synchronization %A Russ Cox %A William Josephson %X The problem of detecting various kinds of update conflicts in file system synchronization following a network partition is well-known. All systems of which we are aware use the version vectors of Parker et al. These require O(R*F) storage space for F files shared among R replicas. We propose a number of different methods, the most space-efficient of which uses O(R*F) space in the worst case, but O(R+F) in the expected case. \par To gain experience with the various methods, we implemented a file synchronization tool called Tra. Based on this experience, we discuss the advantages and disadvantages of each particular method. \par Tra itself turns out to be useful for a variety of tasks, including home directory maintenance, operating system installation, and managing offline work. We discuss some of these uses. %X tr-01-01.ps.gz %R TR-02-01 %D 2001 %T Incomplete Tree Search using Adaptive Probing %A Wheeler Ruml %X When not enough time is available to fully explore a search tree, different algorithms will visit different leaves. Depth-first search and depth-bounded discrepancy search, for example, make opposite assumptions about the distribution of good leaves. Unfortunately, it is rarely clear a priori which algorithm will be most appropriate for a particular problem. Rather than fixing strong assumptions in advance, we propose an approach in which an algorithm attempts to adjust to the distribution of leaf costs in the tree while exploring it. By sacrificing completeness, such flexible algorithms can exploit information gathered during the search using only weak assumptions. As an example, we show how a simple depth-based additive cost model of the tree can be learned on-line. Empirical analysis using a generic tree search problem shows that adaptive probing is competitive with systematic algorithms on a variety of hard trees and outperforms them when the node-ordering heuristic makes many mistakes. Results on boolean satisfiability and two different representations of number partitioning confirm these observations. Adaptive probing combines the flexibility and robustness of local search with the ability to take advantage of constructive heuristics. %X tr-02-01.ps.gz %R TR-03-01 %D 2001 %T gNarLI: A Practical Approach to Natural Language Interfaces to Databases %A AJ Shankar %A Wing Yung %X Most attempted natural language interfaces to databases take too general an approach, and so either require large amounts of setup time or do not provide adequate natural language support for a given domain. \par Our approach, gNarLI, is a pragmatic one: it focuses on one small, easily- and well-defined domain at a time (say, an Oscars movie database). Domain definitions consist of simple rule-based actions. gNarLI provides a flexible pattern-matching preprocessor, an intuitive join processor, and language-independent pronoun support ("Who won best actor in 1957?" ... "Where was he born?"), in addition to considerable freedom in mapping rules to one or more portions of a SQL statement. \par All aspects of the program are fully customizable on a domain basis. Appropriately sized domains can be constructed from scratch in less than 15 hours. The ease and speed of domain construction (requiring no programming skills) make gNarLi adaptable to different areas with little difficulty. \par For small domains, gNarli works surprisingly well. As domain size increases, though, its join processor and rules-based approach become less successful. %X tr-03-01.ps.gz %R TR-04-01 %D 2001 %T Assigning Features using Additive Clustering %A Wheeler Ruml %X If the promise of computational modeling is to be fully realized in higher-level cognitive domains such as language processing, principled methods must be developed to construct the semantic representations that serve as these models' input and/or output. In this paper, we propose the use of an established formalism from mathematical psychology, additive clustering, as a means of automatically assigning discrete features to objects using only pairwise similarity data. Similar approaches have not been widely adopted in the past, as existing methods for the unsupervised learning of such models do not scale well to large problems. We propose a new algorithm for additive clustering, based on heuristic combinatorial optimization. Through extensive empirical tests on both human and synthetic data, we find that the new algorithm is more effective than previous methods and that it also scales well to larger problems. By making additive clustering practical, we take a significant step toward scaling connectionist models beyond hand-coded examples. %X tr-04-01.ps.gz %R TR-05-01 %D 2001 %T An Algebraic Approach to File Synchronization %A Norman Ramsey %A El\H{o}d Csirmaz %X We present a sound and complete proof system for reasoning about operations on filesystems. The proof system enables us to specify a file-synchronization algorithm that can be combined with several different conflict-resolution policies. By contrast, previous work builds the conflict-resolution policy into the specification, or worse, does not specify the behavior formally. We present several alternatives for conflict resolution, and we address the knotty question of timestamps. %X tr-05-01.ps.gz %R TR-06-01 %D 2001 %T Variations on Random Graph Models for the Web %A Eleni Drinea %A Mihaela Enachescu %A Michael Mitzenmacher %X In this paper, we introduce variations on random graph models for Web-like graphs. We provide new ways of interpreting previous models, introduce non-linear models that extend previous work, and suggest models based on feedback between users and search engines. %X tr-06-01.ps.gz %R TR-07-01 %D 2001 %T Dynamic Models for File Sizes and Double Pareto Distributions %A Michael Mitzenmacher %X In this paper, we introduce and analyze a new generative user model to explain the behavior of file size distributions. Our Recursive Forest File model combines ideas from recent work by Downey with ideas from recent work on random graph models for the Web. Unlike similar previous work, our Recursive Forest File model allows new files to be created and old files to be deleted over time, and our analysis covers problematic issues such as correlation among file sizes. Moreover, our model allows natural variations where files that are copied or modified are more likely to be copied or modified subsequently. \par Previous empirical work suggests that file sizes tend to have a lognormal body but a Pareto tail. The Recursive Forest File model explains this behavior, yielding a double Pareto distribution, which has a Pareto tail but close to a lognormal body. We believe the Recursive Forest model may be useful for describing other power law phenomena in computer systems as well as other fields. %X tr-07-01.ps.gz %R TR-08-01 %D 2001 %T A Brief History of Generative Models for Power Law and Lognormal Distributions %A Michael Mitzenmacher %X Power law distributions are an increasingly common model for computer science applications; for example, they have been used to describe file size distributions and in- and out-degree distributions for the Web and Internet graphs. Recently, the similar lognormal distribution has also been suggested as an appropriate alternative model for file size distributions. In this paper, we briefly survey some of the history of these distributions, focusing on work in other fields. We find that several recently proposed models have antecedents in work from decades ago. We also find that lognormal and power law distributions connect quite naturally, and hence it is not surprising that lognormal distributions arise as a possible alternative to power law distributions. %X tr-08-01.ps.gz %R TR-09-01 %D 2001 %T %A Steven Gortler %X %R TR-10-01 %D 2001 %T Integrating On-Demand Alias Analysis into Schedulers for Advanced Microprocessors %A Robert Costa %X %R TR-11-01 %D 2001 %T Progressive Profiling: A Methodology Based on Profile Propagation and Selective Profile Collection %A Zheng Wang %X In recent years, Profile-Based Optimization (PBO) has become a key technique in program optimization. In PBO, the optimizer uses information gathered during previous program executions to guide the optimization process. Even though PBO has been implemented in many research systems and some software companies, there has been little research on how to make PBO effective in practice. \par In today's software industry, one major hurdle in applying PBO is the conflict between the need for high-quality profiles and the lack of time for long profiling runs. For PBO to be effective, the profile needs to be representative of how the users or a particular user runs the program. For many modern applications that are large and interactive, it takes a significant amount of time to collect high-quality profiles. This problem will only become more prominent as application programs grow more complex. A lengthy profiling process is especially impractical in software production environments, where programs are modified and rebuilt almost daily. Without enough time for extensive profiling runs, the benefit from applying PBO is severely limited. This in turn hampers the interest in running PBO and increases the dependency on hand tuning in software development and testing. \par In order to obtain high-quality profiles in a software production environment without lengthening the daily build cycle, we seek to change the current practice where a new profile must be generated from scratch for each new program version. Most of today's profiles are generated for a specific program version and become obsolete once the program changes. We propose progressive profiling, a new profiling methodology that propagates a profile across program changes and re-uses it on the new version. We use static analysis to generate a mapping between two versions of a binary program, then use the mapping to convert an existing profile for the old version so that it applies to the new version. When necessary, additional profile information is collected for part of the new version to augment the propagated profile. Since the additional profile collection is selective, we avoid the high expense of re-generating the entire profile. With progressive profiling, we can collect profile information from different generations of a program and build a high-quality profile through accumulation over time, despite frequent revisions in a software production environment. \par We present two different algorithms for matching binary programs for the purpose of profile propagation, and use common application programs to evaluate their effectiveness. We use a set of quantitative metrics to compare propagated profiles with profiles collected directly on the new versions. Our results show that for program builds that are weeks or even months apart, profile propagation can produce profiles that closely resemble directly collected profiles. To understand the potential for time saving, we implement a prototype system for progressive profiling and investigate a number of different system models. We use a case study to demonstrate that by performing progressive profiling over multiple generations of a program, we can save a significant amount of profiling time while sacrificing little profile quality. %X tr-11-01.ps.gz %R TR-12-01 %D 2001 %T Automated translation: generating a code generator %A Lee D. Feigenbaum %X A key problem in retargeting a compiler is to map the compiler's intermediate representation to the target machine's instruction set. /par One method to write such a mapping is to use grammar-like rules to relate a tree-based intermediate representation with an instruction set. A dynamic-programming algorithm finds the least costly instructions to cover a given tree. Work in this family includes Burg, BEG, and twig. The other method, utilized by gcc and VPO, uses a hand-written ``code expander'' which expands intermediate representation into naive code. The naive code is improved via machine-independent optimizations while maintaining it as a sequence of machine instructions. Because they are inextricably linked to a compiler's intermediate representation, neither of these mappings can be reused for anything other than retargeting one specific compiler. /par Lambda-RTL is a language for specifying the semantics of an instruction set independent of any particular intermediate representation. We analyze the properties of a machine from its Lambda-RTL description, then automatically derive the necessary mapping to a target architecture. By separating such analysis from compilers' intermediate representations, Lambda-RTL in conjunction with our work allows a single machine description to be used to build multiple compilers, along with other tools such as debuggers or emulators. \par Our analysis categorizes a machine's storage locations as special registers, general-purpose registers, or memory. We construct a data-movement graph by determining the most efficient way to move arbitrary values between locations. We use this information at compile time to determine which temporary locations to use for intermediate results of large computations. \par To derive a mapping from an intermediate representation to a target machine, we first assume a compiler-dependent translation from the intermediate representation to register-transfer lists. We discover at compile-compile time how to translate these register-transfer lists to machine code and also which register-transfer lists we can translate. To do this, we observe that values are either constants, fetched from locations, or the results of applying operators to values. Our data-movement graph covers constants and fetched values, while operators require an appropriate instruction to perform the effect of the operator. We search through an instruction set discovering instructions to implement operators via the use of algebraic identities, inverses, and rewrite laws and the introduction of unwanted side effects. %X tr-12-01.ps.gz %R TR-13-01 %D 2001 %T Instruction-Stream Compression %A Christian James Carrillo %X This thesis presents formal elements of instruction-stream compression. \par We introduce notions of instruction representations, compressors and the general ``patternization'' function for representations to sequences. We further introduce the Lua-ISC language, an implementation of these elements. Instruction-stream compression algorithms are expressed, independently of the target architecture, in Lua-ISC. The language itself handles instruction decoding and encoding, patternization and compression; programs within it are compact and readable. \par We perform experiments in instruction representation using Lua-ISC. Our results indicate that the choice of representation and patternization method affect compressor performance, and suggest that current design methodologies may overlook opportunities in lower-level representations. \par Finally, we discuss four instruction-stream compression algorithms and their expressions in Lua-ISC, two of which are our own. The first exploits inter-program redundancy due to static compilation; the second allows state-based compression techniques to function in a random-access environment by compressing instructions as sets of blocks. %X tr-13-01.ps.gz %R TR-14-01 %D 2001 %T CacheDAFS: User Level Client-Side Caching for the Direct Access File System %A Salimah Addetia %X This thesis focuses on the design, implementation and evaluation of user-level client-side caching for the Direct Access File System (DAFS). DAFS is a high performance file access protocol designed for local file sharing in high-speed, low latency networked environments. \par DAFS operates over memory-to-memory interconnects such as Virtual Interface (VI). VI provides a standard for efficient network communication by moving software overheads into hardware and eliminating the operating system from common data trasfers. While much work has been done on message passing and distributed shared memory in VI-like environments, DAFS is one of the first attempts to extend user-level networking to network file systems. In the environment of high-speed networks with virtual interfaces, software overheads such as data copies and translation, buffer managment and context switches become important bottlenecks. The DAFS protocol departs from traditional network file system practices to enhance performance. \par Distributed file systems use client-side caching to improve performance by reducing network traffic, disk traffic and server load. The DAFS client omits any caching. This thesis presents a user-space cache for DAFS called cacheDAFS with a careful design that avoids most bottlenecks in network file system protocols and user-level networking environments. CacheDAFS maintains perfect consistency among DAFS clients using NFSv4-like open delegations. Changes to the DAFS API in order to add caching are minimal and results show that DAFS applications can use cacheDAFS to reap all the standard benefits of caching. %X tr-14-01.ps.gz %R TR-01-02 %D 2002 %T Heuristic Search in Bounded-depth Trees: Best-Leaf-First Search %A Wheeler Ruml %X Many combinatorial optimization and constraint satisfaction problems can be formulated as a search for the best leaf in a tree of bounded depth. When exhaustive enumeration is infeasible, a rational strategy visits leaves in increasing order of predicted cost. Previous systematic algorithms for this setting follow a predetermined search order, making strong implicit assumptions about predicted cost and using problem-specific information inefficiently. We introduce a framework, best-leaf-first search (BLFS), that employs an explicit model of leaf cost. BLFS is complete and visits leaves in an order that efficiently approximates increasing predicted cost. Different algorithms can be derived by incorporating different sources of information into the cost model. We show how previous algorithms are special cases of BLFS. We also demonstrate how BLFS can derive a problem-specific model during the search itself. Empirical results on latin square completion, binary CSPs, and number partitioning problems suggest that, even with simple cost models, BLFS yields competitive or superior performance and is more robust than previous methods. BLFS can be seen as a model-based extension of iterative-deepening A*, and thus it unifies search for combinatorial optimization and constraint satisfaction with traditional AI heuristic search for shortest-path problems. %X tr-01-02.ps.gz %R TR-02-02 %D 2002 %T Bounds and Improvements for BiBa Signature Schemes %A Michael Mitzenmacher %A Adrian Perrig %X This paper analyzes and improves the recently proposed bins and balls (BiBa) signature, a new approach for designing signatures from one-way functions without trapdoors. \par We first construct a general framework for signature schemes based on the balls and bins paradigm and propose several new related signature algorithms. The framework also allows us to obtain upper bounds on the security of such signatures. Several of our signature algorithms approach the upper bound. We then show that by changing the framework in a novel manner we can boost the efficiency and security of our signature schemes. We call the resulting mechanism Powerball signatures. Powerball signatures offer greater security and efficiency than previous signature schemes based on one-way functions without trapdoors. %X tr-02-02.ps.gz %R TR-03-02 %D 2002 %T Scaling Filename Queries in a Large-Scale Distributed File System %A Jonathan Ledlie %A Laura Serban %A Dafina Toncheva %X We have examined the tradeoffs in applying regular and Compressed Bloom filters to the name query problem in distributed file systems and developed and tested a novel mechanism for scaling queries as the network grows large. Filters greatly reduced query messages when using Fan's "Summary Cache" in web cache hierarchies, a similar albeit smaller, searching problem. We have implemented a testbed that models a distributed file system and run experiments that test various configurations of the system to see if Bloom filters could provide the same kind of improvements. In a realistic system, where the chance that a randomly queried node holds the file being searched for is low, we show that filters always provide lower bandwidth/search and faster time/search, as long as the rates of change of the files stored at the nodes is not extremely high relative to the number of searches. In other words, we confirm the intuition that keeping some state about the contents of the rest of the system will aid in searching as long as acquiring this state is not overly costly and it does not expire too quickly. \par The grouping topology we have developed divides n nodes into log(n) groups, each of which has a representative node that aggregates a composite filter for the group. All nodes not in that group use this low-precision filter to weed out whole collections of nodes by probing these filters, only sending a search to be proxied by a member of the group if the probe of the group filter returns positively. Proxied searches are then carried out within a group, where more precise (more bits per file) filters are kept and exchanged between the n/log(n) nodes in a group. Experimental results show that both bandwidth/search and time/search are improved with this novel grouping topology. %X tr-03-02.ps.gz %R TR-04-02 %D 2002 %T %R TR-01-87 %D 1987 %T A Very Simple Construction for Atomic Multiwriter Register %A Ming Li %A Paul M.B. Vitanyi %X This paper introduces a new and conceptually very simple algorithm to implement an atomic {\it n} -reader {\it n} -writer variable directly from atomic 1-reader 1-writer variables, using bounded tags. The algorithm is developed top-down from the unbounded tag method in [VA]. This is the first direct such construction, and considerably improves the complexity of all known compound constructions. The algorithm uses new techniques, but its main virtue is that it is {\it conceptually very simple and easily proved correct.} %R TR-02-87 %D 1987 %T Efficient Dispersal of Information for Security Load Balancing and Fault Tolerance. %A Michael O. Rabin %X We develop an Information Dispersal Algorithm (IDA) which breaks a file $F$ of length $L=\mid F\mid$ into $n$ pieces of $F_ {i} , 1 \leq i \leq n$, each of length $\mid F_ {i} \mid=L/m$ , so that every $m$ pieces suffice for reconstructing $F$. Dispersal and reconstruction are computationally efficient. The sum of the lengths $\mid F_ {i} \mid$ is $(n/m) \cdot L$. Since $n/m$ can be chosen to be close to $1$, the IDA is space efficient. IDA has numerous applications to secure and reliable storage of information in computer networks and even on single disks, to fault-tolerant and efficient transmission of information in networks, and to communications between processors in parallel computers. For the latter problem we get provably time efficient and highly fault-tolerant routing on the $n$ - cube, using just constant size buffers. %R TR-03-87 %D 1987 %T Learning in the Presence of Malicious Errors. %A Michael Kearns %A Ming Li %X We study an extension to the Valiant model of machine learning from examples in which errors may be present in the sample data. We give strong bounds on the rate of error tolerable when the errors are of a ``malicious'' nature, and show that it is crucial to take both positive and negative examples when such errors exist. Quantitative comparisons between the malicious model of errors and a model of uniform random noise introduced by Angluin and Laird are made. A greedy heuristic for a new generalization of set covering that is of independent interest is given, with nontrivial applications to learning with errors. We also give a reduction from learning to set cover. %R TR-04-87 %D 1987 %T Two Phase Gossip:\\ Managing Distributed Event Histories. %A Abdelsalam Heddaya %A Meichun Hsu %A William Weihl %X We describe a distributed protocol that operates on replicated data objects---of arbitrary abstract types---represented in terms of the history of object events, rather than in terms of object values. In general, a site that stores a replica of an object has only partial knowledge of the object's event history. We call such a replica an {\it object representative} . The goal of the protocol is to limit the sizes of the histories by checkpointing and discarding old events. We treat separately the three functions of (1) propagating events to sites that do not know about them, (2) checkpointing, and (3) discarding old events. A site rolls forward its checkpoint state as far as it {\it knows} its local version of the object's history to be complete, but it discards old events only as far as it knows that all other sites {\it know} their local histories to be complete. Our protocol propagates events among sites in {\it gossip} messages exchanged in the background in a {\it two phase} manner. Each site maintains several timestamp vectors indicating the extent of its knowledge of other sites' knowledge of events. These timestamp vectors are included in gossip messages and used to decide the extent of global-completeness of each site's local version of the event history. We formally define the correctness criteria for our protocol in terms of the completeness properties of histories. %R TR-05-87 %D 1987 %T An Integrated Toolkit for Operating System Security. %A Michael O. Rabin %A J.D. Tygar %R TR-06-87 %D 1987 %T On the influence of single participant in coin flipping schemes. %A Benny Chor %A Mih\`{a}ly Ger\`{e}b-Graus %X We prove that in a one round fair coin flipping scheme with {\it n} participants, either the {\it average} influence of all participants is at least $3/n - o(l/n),$ or there is at least one participant whose influence is $\Omega \left( n^ {-5/6} \right).$ %R TR-07-87 %D 1987 %T The Complexity of Parallel Comparison Merging. %A Mih\`{a}ly Ger\`{e}b-Graus %A Danny Krizanc %X We prove a worst case lower bound of $\Omega(log log n)$ for randomized algorithms merging two sorted lists of length $n$ in parallel using $n$ processors on Valiant's parallel computation tree model. We show how to strengthen this result to a lower bound for the expected time taken by any algorithm on the uniform distribution. Finally, bounds are given for the average time required for the problem when the number of processors is less than and greater than $n$. %R TR-08-87 %D 1987 %T The Approximation of the Permanent is Still Open\\ - A Flaw in Broader's Proof- %A Milena Mihail %X In [B] Broder claims to have obtained a polynomial time randomized algorithm for approximating the permanent of dense 0-1 matrices. In this note we point out a major flaw in the analysis of the algorithm: the process C2 defined in [B] as a coupling is in fact {\it not} such. Thus, it remains open whether Broder's algorithm gives satisfactory estimates. %R TR-09-87 %D 1987 %T $k+1$ Heads are Better Than $k$ for PDA's %A Marek Chrobak %A Ming Li %X We prove the following conjecture stated by Harrison and Ibarra in 1968 [HI, p.462]: There are languages accepted by $(k$+1)-head 1- way deterministic pushdown automata $((k$+1)-DPDA) but not by $k$- head 1-way pushdown automata $(k$-PDA), for every $k$. (Partial solutions for this conjecture can be found in [M1,M2,C].) On the assumption that their conjecture holds, [HI] also derived some other consequences. Now all those consequences become theorems. For example, the class of languages accepted by $k$-PDA's is not closed under intersection and complementation. Several other interesting consequences also follow: CFL$\not\subseteq\cup_ {k} $DPDA($k$) and FA(2)$\not\subseteq\cup_ {k} $DPDA($k$), where DPDA($k$)=\ {L$\mid$L is accepted by a $k$-DPDA} and FA(2)=\ {L$\mid$L is accepted by 2-head FA} . Our proof is constructive (that is, not based on diagonalization). Before, the ``$k$+1 versus $k$ heads'' problem was solved by diagonalization and translation methods [I2,M2,M3, M4,S] for stronger machines (2 -way, etc), and by traditional counting arguments [S2,IK,YR,M1] for weaker machines ($k$-FA-$k$-head counter machines, etc). %R TR-10-87 %D 1987 %T Tape versus Queue and Stacks:\\ The Lower Bounds %A Ming Li %A Paul M.B. Vit\`{e}nyi %X Several new optimal or nearly optimal lower bounds are derived on the time needed to simulate queue, stacks (stack = pushdown store) and tapes by one off-line single-head tape-unit with one- way input, for both the deterministic case and the nondeterministic case. The techniques rely on algorithmic information theory (Kolmogorov complexity). %R TR-11-87 %D 1987 %T Plans for Discourse %A Barbara J. Grosz %A Candace L. Sidner %R TR-12-87 %D 1987 %T Oblivious Secret Computation %A Donald Beaver %X Recently, methods for secret, distributed, and fault-tolerant computation have been developed. Those techniques show how to ``compile'' a function to produce a protocol in which a distributed system evaluates that function at secret arguments, revealing the value of the function but not the values of the arguments, nor the values in intermediate steps of the computation. We extend these methods in two ways: first, we show that the {\it function} itself need not be revealed. That is, we show how to evaluate a function secretly, without revealing any information about the function itself, other than a bound on the time and space needed to compute it. This result has fundamental implications for distributed system security and fault-tolerant computing. Secondly, we show that revealing the {\it result} of a secret computation is not necessary, and we give useful applications in which the values of secretly computed functions are never revealed. One such application extends the ideas of oblivious transfer to the distributed environment. Using this extension, we solve classical, two-party oblivious transfer {\it without using cryptography} , unlike all previous solutions. %R TR-13-87 %D 1987 %T Polynomially Sized Boolean Circuits Are\\ Not Learnable %A Donald Beaver %X Polynomial-sized boolean circuits provide a powerful and general framework to describe many naturally occurring concepts. Because of the wide range of functions which can be described or computed by boolean circuits, an algorithm to learn arbitrary polynomial- sized circuits would have broad implications for artificial intelligence and learning theory. We prove, however, a conjecture made by Valiant that, if one-way functions exist, then the class of polynomial-sized boolean circuits is not probably-approximately learnable. We also discuss how this result might be strengthened by eliminating the assumption of one-way functions. %R TR-14-87 %D 1987 %T Oblivious Routing with Limited Buffer Capacity %A Danny Krizanc %X The problem of oblivious routing in fixed connection networks with a limited amount of space available to buffer packets is studied. We show that for a $n$ processor network with a constant number of connections and a constant number of buffers any deterministic pure source-oblivious strategy realizing all partial permutations requires $\Omega(n)$ time. The consequence of this result for well-known networks is discussed. %R TR-01-88 %D 1988 %T Bounded Time-Stamps %A Amos Israeli %A Ming Li %X Time-stamps are labels which a system adds to its data items. These labels enable the system to keep track of temporal precedence relation among its data elements. Traditionally time- stamps are used as unbounded numbers and inevitable overflows cause a loss of this precedence relation. In this paper we develop a theory {\it bounded time-stamps} . Time-stamp systems are defined and the complexity of their implementation is fully analyzed. This theory gives a general tool for converting time- stamp based protocols to bounded protocols. The power of this theory is demonstrated by a novel, conceptually simple, protocol for a multi-writer atomic register, as well as by proving, for the first time, a non-trivial lower bound for such a register. %R TR-02-88 %D 1988 %T Two Decades of Applied Kolmogorov Complexity %A Ming Li %A Paul M.B. Vitanyi %X This exposition is a survey of elegant and useful applications of Kolmogorov complexity. We distinguish three areas: I) Application of the fact that some strings are compressible. This includes a strong version of G\" {o} del's incompleteness theorem. II) Lower bound arguments which rest on application of the fact that certain strings cannot be compressed at all. Applications range from Turing machines to electronic chips. III) Other issues. For instance, the foundations of Probability Theory, a priori probability, and resource-bounded Kolmogorov complexity. Applications range from NP-completeness to inductive inference in Artificial Intelligence. %R TR-03-88 %D 1988 %T Hybrid Beam-Ray Tracing %A Joe Marks %A Robert J. Walsh %A Mark Friedell %X Ray tracing is the most accurate, but unfortunately also the most expensive of current rendering techniques. Beam tracing, suggested by Heckbert and Hanrahan, is a generalization of ray tracing that exploits area coherence to trace multiple rays in parallel. We present a new approach to beam tracing derived from the area-subdivision technique of Warnock's hidden-surface algorithm. We use this approach to beam tracing as the basis of a hybrid beam-ray tracing algorithm. The algorithm uses beam tracing to render large coherent regions of the image, and ray tracing to render complex regions. A heuristic decision procedure chooses between beam tracing and ray tracing a given area on the basis of estimated rendering costs. Experimental results show that our hybrid algorithm is more efficient than either ray tracing or beam tracing used alone. %R TR-04-88 %D 1988 %T Unsafe Operations in B-trees %A Bin Zhang %A Meichun Hsu %X A simple mathematical model for analyzing the dynamics of a B- tree node is presented. From the solution of the model, it is shown that the simple technique of allowing a B-tree node to be slightly less than half full can significantly reduce the rate of split, merge and borrow operations. We call split, merge, borrow and balance operations {\it unsafe} operations in this paper. In a multi-user environment, lower unsafe operation rate implies less blocking and higher throughput, even when tailored concurrency control algorithms (e.g., that proposed by [Lehman \& Yao]) are used. Less unsafe operation rate also means a longer life time of an optimally initialized B-tree (e.g., compact B-tree). It is in general useful to have an analytical model which can predict the rate of unsafe operations in a dynamic data structure, not only for comparing the behavior of variations of B trees, but also for characterizing workload for performance evaluation of different concurrency control algorithms for such data structures. The model presented in this paper represents a starting point in this direction. %R TR-05-88 %D 1988 %T The Mean Value Approach to Performance Evaluation of Cautious Waiting %A Meichun Hsu %A Bin Zhang %X We propose a deadlock-free locking-based concurrency control algorithm, called {\it cautious waiting} , which allows for a limited form of waiting, and present an analytical solution to its performance evaluation based on the mean-value approach. The proposed algorithm is simple to implement. From the modeling point of view, we are able to track both the restart rate and the blocking rate properly, and we show that to solve for the model we only need to solve for the root of a polynomial. From the performance point of view, the analytical tools developed enable us to see that the cautious waiting algorithm manage to achieve a {\it delicate balance} between restart and blocking, and therefore able to perform better (i.e., has higher throughput) than both the no waiting and the general waiting algorithms. This result is encouraging for other deadlock avoidance-oriented locking algorithms. %R TR-06-88 %D 1988 %T An Architecture-Independent Model for Parallel Programming %A Gary Wayne Sabot %X This dissertation describes a fundamentally new way of looking at parallel programming called the paralation model. The paralation model consists of a new data structure and a small, irreducible set of operators. The model can be combined with any base language to produce a concrete parallel language. One important goal of the model is ease of use for general problem solving. The model must provide tools that address the problems of the programmer. Equally important, languages based upon the model must be easy to compile, in a transparent and efficient manner, for a broad range of target computer architectures (for example, Multiple Instruction Multiple Data (MIMD) or Single Instruction Multiple Data (SIMD) processors; bus-based, butterfly, or grid interconnect; and so on). Several compilers based on this model exist, including one that produces code for the 65,536 processor Connection Machine. The dissertation includes a short (two pages long) operational semantics for a paralation language based on Common Lisp. By executing this code, the interested reader can experiment with the paralation constructs. The dissertation (as well as a disk containing a complete implementation of Paralation Lisp) is available from MIT Press as "The Paralation Model: Architecture-Independent Parallel Programming". %R TR-07-88 %D 1988 %T Managing Databases in Distributed Virtual Memory %A Meichun Hsu %A Va-On Tam %R TR-08-88 %D 1988 %T Modeling Performance Impact of Hot Spots %A Meichun Hsu %A Bin Zhang %X An important factor that may affect the performance of the concurrency control algorithm of a database system is the skewness of the distribution of accesses to data granules (i.e., non-uniform access pattern.) In this paper, we examine the impact of hot spot analytically by employing the mean-value approach to performance modeling of concurrency control algorithms. The impact of non-uniform accesses is approached based on a new principle of {\it data flow balance} . Using data flow balance we generalize the b-c access pattern (i.e. b\% of accesses directly to c\% of data granules) to arbitrary distributions, and solve the non-uniform access model for two classes of two phase locking algorithms. We also show that the database reduction factor employed previously in the literature to handle non-uniform access is in fact an upper bound. %R TR-09-88 %D 1988 %T Analytical Preprocessing for Likelihood Ratio Methods %A Bin Zhang %X An analytical preprocessing (APP) method for likelihood ratio method, an algorithm for finding the derivatives of a performance function of Discrete Event Dynamical Systems (DEDS), is developed. The convergency of the integral used in Likelihood ratio method and the legality of interchanging $\frac {\partial} {\partial\mu} $ with $\int$ are proved for a large class of sample performance functions. %R TR-10-88 %D 1988 %T Exemplar-based Learning: Theory and Implementation %A Steven Salzberg %X Exemplar-based learning is a theory of inductive learning in which learning is accomplished by storing objects in Euclidean $n$- space, $E^n$, as hyper-rectangles. In contrast to the usual inductive learning paradigm, which learns by replacing symbolic formulae by more general formulae, the generalization process for exemplar-based learning modifies hyper-rectangles by growing and reshaping them in a well-defined fashion. Some advantages and disadvantages of the theory are described, and the theory is compared to other inductive learning theories. An implementation has been tested on several different domains, four of which are presented here: predicting the recurrence of breast cancer, classifying iris flowers, predicting stock prices, and predicting survival times for heart attack patients. The robust learning behavior and easily understandable representations produced by the implementation support the claim that exemplar-based learning offers advantages over other learning models for some classes of problems. %R TR-11-88 %D 1988 %T An Algorithm for Self Diagnosis in Distributed Systems %A Azer A. Bestavros %X In this paper, and based on a powerful diagnostic model, we present a distributed diagnosis algorithm which would make a distributed system capable of self repairing (replacing) faulty units while operating in a gracefully degraded fault-tolerant manner. The algorithm allows the on-line reentry of repaired (replaced) units, and deals effectively with synchronization and time stamping problems. We define an active set to be a set of fault-free processing elements that agree on a unified diagnosis. The algorithm we propose, would enable each healthy processor of reliably identifying the largest possible active set to which it belongs. If each processor refrains from further dealings with units not in its active set, then logical reconfiguration can be achieved. Concurrent with this on-line fault isolation, analyzers in the system carry out detailed diagnosis to locate the faulty units, and upon a proper report, the associated controllers can perform the required repair/replace. Later, recovering units will be allowed to reenter the system. The algorithm is shown to be robust in that it guarantees the survival of any possible active set. In particular, if there are no more than {\it t} faults in the system, the algorithm guarantees that all fault-free processors will eventually identify the fault pattern, provided that the system is {\it t self diagnosable with without repair} . Furthermore, if the system is {\it t self diagnosable with repair,} then the algorithm guarantees that at least one controller will be able to repair/replace (at least) one faulty unit. %R TR-12-88 %D 1988 %T On Coupling and the Approximation of the Permanent %A Milena Mihail %X We discuss in detail the limitations of the coupling technique for a Markov Chain on a population with non-trivial combinatorial structure namely, perfect and near perfect matchings of a dense bipartite graph. %R TR-13-88 %D 1988 %T Distributed Non-Cryptographic Oblivious Transfer with Constant Rounds Secret Function Evaluation %A Donald Beaver %X We develop tools for distributively and secretly computing the {\it parity} of a hidden value in a {\it constant number of rounds} without revealing the secret argument or result, improving the previous $O(log \; n)$ round requirement, and we apply them to a {\it generalization of Oblivious Transfer} to the distributed environment. Oblivious Transfer, in which Alice reveals a bit to Bob with 50-50 probability, but without knowing whether or not Bob received it, has found diverse applications. In the case where Alice and Bob are alone, and they cannot trust each other, all known solutions require extensive cryptography. If, however, there exist other trustworthy parties around, we show that oblivious transfer can be performed without any cryptography, even if nobody knows who the trustworthy parties are. It suffices that at most a third or a half of the participants are dishonest. Using previously developed methods for distributed secret computation, we give a solution requiring $O(log \; n)$ rounds of interaction. We then give novel techniques to reduce the number of rounds to a constant. These new techniques include computing a random 0-1 secret, taking the multiplicative inverse of a secret, and normalizing a secret (without revealing the result) in a constant expected number of rounds. Such tools allow arbitrary functions over a polynomially sized domain of arguments to be evaluated quickly and efficiently. The methods developed for this paper provide a basis for solutions to other problems in distributed secret computation. %R TR-14-88 %D 1988 %T Learning Boolean Formulae or Finite Automata is as Hard as Factoring %A Michael Kearns %A Leslie G. Valiant %X We prove that the problem of inferring from examples Boolean formulae or deterministic finite automata is computationally as difficult as deciding quadratic residuosity, inverting the RSA encryption function and factoring Blum integers (composite number $p \cdot q$ where $p$ and $q$ are primes both congruent to 3 {\it mod} 4). These results are for the distribution-free model of learning. They hold even when the inference task is that of deriving a probabilistic polynomial time classification algorithm that predicts the correct value of a random input with probability at least $\frac {1} {2} + \frac {1} {p(s)} ,$ where $s$ is the size of the Boolean formula or deterministic finite automaton, and $p$ is any polynomial. %R TR-15-88 %D 1988 %T Merging and Routing on Parallel Models of Computation %A Daniel David Krizanc %X The term ``parallel computation'' encompasses a large and diverse set of problems and theory. In this thesis we use it to refer to any computation which consists of a large collection of tightly coupled synchronized processors working together to solve a terminating computational problem. Models of such computation capture to varying degrees properties of existing and proposed parallel machines. Our main concerns here are with how the models deal with the communication between processors and what effect the introduction of randomness has on the models. The main results of the thesis are: (a) a lower bound on the average time required by a parallel comparison tree (PCT) to merge two lists (b) a tight tradeoff between the running time, the probability of failure and the number of independent random bits used by an oblivious routing strategy for a fixed connection network (c) a similar tradeoff for the problem of finding the median on the PCT model and (d) a tradeoff between the amount of storage required by the nodes of a network and the deterministic time complexity of a class of oblivious routing strategies for the network. %R TR-16-88 %D 1988 %T Parallel Bin Packing Using First-fit and K-delayed Best-fit Heuristics %A Azer Bestavros %A William McKeeman %X In this paper, we present and contrast simulation results for the {\it Bin Packing} problem. We used the Connection Machine to simulate the asymptotic behavior of a variety of packing heuristics. Several parallel algorithms are considered, presented and contrasted in the paper. The seemingly serial nature of the bin packing simulation has prohibited previous experimentations from going beyond sizes of several thousands bins. We show that by adopting a fairly simple data parallel algorithm a speedup by a factor of $N$ over straightforward serial implementation is possible. Sizes of up to hundreds of thousands of bins have been simulated for different parameters and heuristics. %R TR-17-88 %D 1988 %T Ray Tracing for Massively-Parallel Machines %A Robert J. Walsh %X Proper data-type modeling for ray tracing on massively-parallel machines is presented as a basis for parallel graphics algorithm construction. A spatial enumeration algorithm is developed, discussed, modeled, and evaluated. The spatial enumeration evaluation is used to motivate an octree algorithm appropriate for parallel machines. These algorithms are suitable for any graphics problem which can be solved using rays, including radiosity. While the focus is computer graphics, general-purpose, massively- parallel machine architecture is discussed sufficiently to support the analysis and to generate some ideas for machine design. %R TR-18-88 %D 1988 %T A New Scan-Line Algorithm for Massively-Parallel Machines %A Robert J. Walsh %X This paper describes how to rapidly generate a synthetic image using a new rendering algorithm for general-purpose, massively- parallel architectures. The presentation shows how algorithm design decisions were made and how to customize the algorithm to a particular architecture. Alternative algorithms are analyzed qualitatively and quantitatively to demonstrate the superiority of the new algorithm. %R TR-19-88 %D 1988 %T Secure Multiparty Protocols Tolerating Half Faulty Processors %A Donald Beaver %X We show how any function of $n$ inputs can be evaluated by a complete network of $n$ processors, revealing no information other than the result of the function, and tolerating up to $t$ maliciously faulty parties for $2t < n$. We demonstrate a resilient method to multiply secret values without using cryptography. The crux of our method is a new, non-cryptographic zero-knowledge technique by which a single party can secretly share values $a_ {1} , \cdots , a_m$ along with another secret $B = P(a_ {1} , \cdots, a_m)$, where $P$ is an arbitrary function; and by which the party can prove to all other parties that $B = P(a_ {1} , \cdots, a_ {m} )$, without revealing $B$ or any other information. Using this technique we give a protocol for multiparty private computation that improves the bound established by previous results, which required $3t < n$. Our protocols allow an exponentially small chance of error, but are provably optimal in their resilience against Byzantine faults. %R TR-20-88 %D 1988 %T Managing Event-based Replication for Abstract Data Types in Distributed Systems %A Abdelsalam Abdelhamid Heddaya %X Data replication enhances the availability of data in distributed systems. This thesis deals with the management of a particular representation of replicated data objects that belong to Abstract Data Types (ADT). Traditionally, replicated data objects have been stored in terms of their {\em states} , or {\em values} . In this thesis, I argue for the viability of state transition {\em histories,} or {\em logs} , as a more suitable storage representation for abstract data types in distributed computing environments. We present two main contributions: a new protocol for reducing message and storage requirements of histories, and a novel reconfiguration and recovery method. In the first protocol, we introduce the notion of {\em two phase gossip} as the primary mechanism for managing distributed replicated event histories. We focus our second protocol for reconfiguration and recovery on enhancing the availability of distributed objects in the face of sequences of failures. Additionally, our reconfiguration protocol supports system administration functions related to the storage of distributed objects. In combination, the two protocols that we propose demonstrate the viability and desirability of the distributed representation of an ADT object as a history of the state transitions that the data object undergoes, rather than as the value or the sequence of values that it assumes. %R TR-01-89 %D 1989 %T A Parallel Algorithm for Eliminating Cycles in Undirected Graphs %A Philip Klein %A Clifford Stein %X We give an efficient parallel algorithm for finding a maximal set of edge-disjoint cycles in an undirected graph. The algorithm can be generalized to handle a weighted version of the problem. %R TR-02-89 %D 1989 %T On the time-space complexity of reachability queries for preprocessed graphs %A Lisa Hellerstein %A Philip Klein %A Robert Wilber %X How much can preprocessing help in solving graph problems? In this paper, we consider the problem of reachability in a directed bipartite graph, and propose a model for evaluating the usefulness of preprocessing in solving this problem. We give tight bounds for restricted versions of the model that suggest that preprocessing is of limited utility. %R TR-03-89 %D 1989 %T On the Magnification of 0-1 Polytopes %A Milena Mihail %A Umesh Vazirani %R TR-04-89 %D 1989 %T Conductance and Convergence of Markov Chains %A Milena Mihail %A Umesh Vazirani %X Let $P$ be an irreducible and strongly aperiodic (i.e. $p_ {ii} \geq \frac {1} {2} \forall i)$ stochastic matrix. We obtain non- asymptotic bounds for the convergence rate of a Markov chain with transition matrix $P$ in terms of the {\it conductance} of $P$. These results have been so far obtained only for time-reversible Markov chains via partially linear algebraic arguments. Our proofs eliminate the linear algebra and therefore naturally extend to general Markov chains. The key new idea is to view the action of a strongly aperiodic stochastic matrix as a weighted averaging along the edges of the {\it underlying graph of} $P$. Our results suggest that the conductance (rather than the second largest eigenvalue) best quantifies the rate of convergence of strongly aperiodic Markov chains. %R TR-05-89 %D 1989 %T Transaction Synchronization in Distributed Shared Virtual Memory Systems %A Meichum Hsu %A Va-On Tam %X Distributed shared virtual memory (DSVM) is an abstraction which integrates the memory space of different machines in a local area network environment into a single logical entity. The algorithm responsible for maintaining this {\it virtually} shared image is called the {\it memory coherence algorithm} . In this paper, we study the interplay between memory coherence and {\it process synchronization} . In particular, we devise two-phase locking- based algorithms in a distributed system under two scenarios: {\it with} and {\it without} an underlying memory coherence system. We compare the performance of the two algorithms using simulation, and argue that significant performance gain can potentially result from bypassing memory coherence and supporting process synchronization directly on distributed memory. We also study the role of the {\it optimistic} algorithms in the context of DSVM, and show that some optimistic policy appears promising under the scenarios studied. %R TR-06-89 %D 1989 %T A VLSI Chip for the Real-time Information Dispersal and Retrieval for Security and Fault-Tolerance %A Azer Bestavros %X In this paper, we describe SETH, a hardwired implementation of the recently proposed ``Information Dispersal Algorithm'' (IDA). SETH allows the real-time dispersal of information into different pieces as well as the retrieval of the original information from the available pieces. SETH accepts a stream of data and a set of ``keys'' so as to produce the required streams of dispersed data to be stored on (or communicated with) the different sinks. The chip might as well accept the streams of data from the different sinks along with the necessary controls and keys so as to reconstruct the original information. We begin this paper by introducing the Information Dispersal Algorithm and give an overview of SETH operation. The different functions are described and system block diagrams of varying levels of details are presented. Next, we present an implementation of SETH using Scalable CMOS technology that has been fabricated using the MOSIS 3-micon process. We conclude the paper with potential applications and extensions of SETH. In particular, we emphasize the promise of the Information Dispersal Algorithm in the design of I/O subsystems, Redundant Array of Inexpensive Disks (RAID) systems, reliable communication and routing in distributed/parallel systems. SETH demonstrates that using IDA in these applications is feasible. %R TR-07-89 %D 1989 %T General Purpose Parallel Architectures %A L.G. Valiant %X The possibilities for efficient general purpose parallel computers are examined. First some network models are reviewed. It is then shown how these networks can efficiently implement some basic message routing functions. Finally, various high level models of parallel computation are described and it is shown that the above routing functions are sufficient to implement them efficiently. %R TR-08-89 %D 1989 %T Bulk-Synchronous Parallel Computers %A L.G. Valiant %X We attribute the success of the von Neumann model of sequential computation to the fact that it is an efficient bridge between software and hardware. On the one hand high level languages can be efficiently compiled on to this model. On the other, it can be efficiently implemented in hardware in current technology. We argue that an analogous bridge between software and hardware is required for parallel computation if that is to become as widely used. We introduce the bulk-synchronous parallel (BSP) model as a candidate for this role. We justify this suggestion by giving a number of results that quantify its efficiency both in implementing some high level language features, as well as in being implemented in hardware. %R TR-09-89 %D 1989 %T Scheduling Initialization Equations for Parallel Execution %A Lilei Chen %R TR-10-89 %D 1989 %T Hiding Information from Several Oracles %A Donald Beaver %A Joan Feigenbaum %X Abadi, Feigenbaum, and Kilian have considered {\it computations with encrypted data} [AFK]. Let $f$ be a function that is not provably computable in randomized polynomial time; randomized polynomial-time machine A wants to query an oracle B for $f$ to obtain $f(x)$, without telling B exactly what $x$ is. Several well-known random-self-reducible functions, such as discrete logarithm and qua\-dra\-tic residuosity, are {\it encryptable} in this sense; that is, A can query B about an instance while hiding some significant information about the instance. It is shown in [AFK] that, if $f$ is an NP-hard function, A cannot query B while keeping secret all but the size of the instance, assuming that the polynomial hierarchy does not collapse. This negative result holds even if the oracle B has ``infinite'' computational power. Here we show that A {\it can} query $n$ oracles B$_1$, $\ldots$, B$_n$, where $n=|x|$, and obtain $f(x)$ while hiding all but $n$ from each B$_i$, {\it for any boolean function f} . This answers a question due to Rivest that was left open in [AFK]. Our proof adapts techniques developed by Ben-Or, Goldwasser, Wigderson and Chaum, Cr\'epeau, Damg\a ard for using Shamir's {\it secret- sharing} scheme to hide information about inputs to distributed computations [BGW], [CCD], [S]. %R TR-11-89 %D 1989 %T Perfect Privacy for Two-Party Protocols %A Donald Beaver %X We examine the problem of computing functions in a distributed manner based on private inputs, revealing the value of a function without revealing any additional information about the inputs. A function $f(x_1,\dots,x_n)$ is $t$-private if there is a protocol whereby $n$ parties, each holding an input value, can compute $f,$ and no subset of $t$ or fewer parties gains any information other than what is computable from $f(x_1,\dots,x_n).$ The class of $t$ -private {\em boolean} functions for $t \geq\lceil \frac {n} {2} \rceil$ was described in [CK89]. We give a characterization of 1- private functions for two parties, without the restriction to the boolean case. We also examine a restricted form of private computation in the $n$-party case, and show that addition is the only privately computable function in that model. %R TR-12-89 %D 1989 %T The Input Output Real-Time Automaton: A model for real-time parallel computation %A Azer A. Bestavros %X In this paper, we propose a unified framework for embedded real- time systems in which it would be possible to see the relation between issues of specification, implementation, correctness, and performance. The framework we suggest is based on the IORTA {\it (Input-Output Real-Time Automata)} model which is an extension of the IOA model introduced. An IORTA is an abstraction that encapsulates a system task. An embedded system is viewed as a set of interacting IORTAs. IORTAs communicate with each other and with the external environment using {\it signals} . A signal carries a sequence of {\it events} , where an event represents an instantiation of an {\it action} at a specific point in time. Actions can be generated by either the environment or the IORTAs. Each IORTA has a {\it state} . The state of an IORTA is observable and can only be changed by local {\it computations} . Computations are triggered by actions and have to be scheduled to meet specific timing constraints. IORTAs can be {\it composed} together to form higher level IORTAs. A specification of an IORTA is a description of its behavior (i.e. how it reacts to stimuli from the environment). An IORTA is said to {\it implement} another IORTA, if it is impossible to differentiate between their external behaviors. This is the primary tool that is used to verify that an implementation meets the required specification. %R TR-13-89 %D 1989 %T The Computational Complexity of Machine Learning %A Michael J. Kearns %X This thesibs is a study of the computational complexity of machine learning from examples in the distribution-free model introduced by L.G. Valiant. In the distribution-free model, a learning algorithm receives positive and negative examples of an unknown target set ( {\it or concept} ) that is chosen from some known class of sets ( {\it concept class} ). These examples are generated randomly according to a fixed but unknown probability distribution representing Nature, and the goal of the learning algorithm is to infer an hypothesis concept that closely approximates the target concept with respect to the unknown distribution. This thesis is concerned with proving theorems about learning in this formal mathematical model. We are interested in the phenomenon of {\it efficient} learning in the distribution-free model, in the standard polynomial-time sense. Our results include general tools for determining the polynomial- time learnability of a concept class, an extensive study of efficient learning when errors are present in the examples, and lower bounds on the number of examples required for learning in our model. A centerpiece of the thesis is a series of results demonstrating the computational difficulty of learning a number of well-studied concept classes. These results are obtained by reducing some apparently hard number-theoretic problems from cryptography to the learning problems. The hard-to-learn concept classes include the sets represented by Boolean formulae, deterministic finite automata and a simplified form of neural networks. We also give algorithms for learning powerful concept classes under the uniform distribution, and give equivalences between natural models of efficient learnability. This thesis also includes detailed definitions and motivation for the distribution- free model, a chapter discussing past research in this model and related models, and a short list of important open problems. %R TR-14-89 %D 1989 %T Learning with Nested Generalized Exemplars %A Steven Lloyd Salzberg %X This thesis presents a theory of learning called nested generalized exemplar theory (NGE), in which learning is accomplished by storing objects in Euclidean $n$-space, $E^n$, as hyper-rectangles. The hyper-rectangles may be nested inside one another to arbitrary depth. In contrast to most generalization processes, which replace symbolic formulae by more general formulae, the generalization process for NGE learning modifies hyper-rectangles by growing and reshaping them in a well-defined fashion. The axes of these hyper-retangles are defined by the variables measured for each example. Each variable can have any range on the real line; thus the theory is not restricted to symbolic or binary values. The basis of this theory is a psychological model called exemplar-based learning, in which examples are stored strictly as points in $E^n$. This thesis describes some advantages and disadvantages of NGE theory, positions it as a form of exemplar-based learning, and compares it to other inductive learning theories. An implementation has been tested on several different domains, four of which are presented in this thesis: predicting the recurrence of breast cancer, classifying iris flowers, predicting survival times for heart attack patients, and a discrete event simulation of a prediction task. The results in these domains are at least as good as, and in some cases significantly better than other learning algorithms applied to the same data. Exemplar-based learning is emerging as a new direction for machine learning research. The main contribution of this thesis is to show how an exemplar-based theory, using nested generalizations to deal with exceptions, can be used to create very compact representations with excellent modelling capability. %R TR-15-89 %D 1989 %T Finite-State Analysis of Asynchronous Circuits with Bounded Temporal Uncertainty %A Harry R. Lewis %R TR-16-89 %D 1989 %T Flux Tracing: A Flexible Infrastructure for Global Shading %A Jon Christensen %A Joe Marks %A Robert Walsh %A Mark Friedell %X Flux tracing is a flexible, efficient, and easily implemented mechanism for determining scene intervisibility. Flux tracing can be combined with a variety of reflection models in several ways, yielding many different global-shading techniques. Several shading techniques based on flux tracing are illustrated. All provide intensity gradients and shadows with penumbras resulting from area light sources at finite distances. Some provide specular reflection and color bleeding, and some are extremely efficient. Flux tracing is both an expedient means of constructing an efficient global shader and a flexible tool for experimental development of global shading algorithms. %R TR-17-89 %D 1989 %T Efficient Use of Image and Intervisibility Coherence in Rendering and Radiosity Calculations %A Mark Friedell %A Joe Marks %A Robert Walsh %A Jon Christensen %X Rendering algorithms based on image-area sampling were first proposed almost 20 years ago. In attempting to solve the hidden- surface problem from many pixels simultaneously, area-sampling algorithms are the most aggressive attempts to exploit image coherence. Although area-sampling algorithms are intuitively appealing, they usually do not perform well. When adequate image coherence is present, however, they can perform extremely well for some image regions. We present in this paper two new, hybrid rendering algorithms that combine area sampling with point- sampling techniques. In the presence of significant image coherence, these algorithms are considerably faster than any generally applicable alternative algorithms; for no image are they perceptibly slower. We also show how similar hybrid techniques can exploit intervisibility coherence to efficiently determine form factors for radiosity calculations. %R TR-18-89 %D 1989 %T The Role of User Models in System Design %A Lisa Rubin Neal %X The inadequacy of systems designer's models of users has led to difficulties with the use of and the rejection of systems. We formulate a multi-dimensional user model, of which two dimensions are related to learning and decision-making styles and additional dimensions are related to the plurality of expertises involved in the use of a system. The incorporation of a user model characterizing relevant cognitive variables allows the tailoring of a system to the needs and abilities of its users. We use a number of computer games as rich tasks which allow us to derive information about users. We present the results of our research examining computer games and discuss the implications of this research for system design. %R TR-19-89 %D 1989 %T Fast Fault-Tolerant Parallel Communication with Low Congestion and On-Line Maintenance Using Information Dispersal %A Yuh-Dauh Lyuu %X Space-efficient Information Dispersal Algorithm (IDA) is applied to communication in parallel computers to achieve fast communication, low congestion, and fault tolerance on various networks. All schemes run {\em within\/} their said time bounds without long delay. Let $N$ denote the size of the network. In the case of hypercube, our communication scheme runs in $\,2\cdot \log N+1\,$ time using constant size buffers. Its probability of successful routing is at least $\,1-N^ {-2.419\cdot\log N + 1.5} $, proving Rabin's conjecture. The same scheme tolerates $\,N/(12\cdot e\cdot\log N)\,$ random link failures with probability at least $\,1-2\cdot N\cdot (\log N)^ {-\log N/12} \,$ ($e=2.718\ldots$). It can also tolerate $\,N/c\,$ random link failures with probability $\,1- O(N^ {-1} )\,$ for some constant $c$. For a class of $d$-way shuffle networks, our scheme runs in $\,\approx 2\cdot \ln N/\ln\ln N\,$ time using constant size buffers. Its probability of successful routing is at least $\,1-N^ {-\ln N/2} $. The same scheme tolerates $\,N/12\,$ random link failures with probability at least $\,1- N^ {-\ln\ln\ln N/12} .\,$ For a class of $d$-way digit-exchange networks, our scheme runs in $\,\approx 6\cdot \ln N/\ln\ln N\,$ time using constant size buffers. Its probability of successful routing is at least $\,1- o(N^ {-7\cdot\ln N} )$. The same scheme tolerates $\, N/(6\cdot e)\,$ random link failures with probability at least $\,1-2\cdot N^ {-\ln\ln\ln N/2} $. Another fault model where links fail independently with a constant probability is also considered. Numerical calculations show that with practical failure probabilities and sizes of the hypercube, our routing scheme for the hypercube performs well with high probability. On-line and efficient wire testing and replacement on the hypercube can be realized if our fault-tolerant routing scheme is used. Let $\,\alpha\,$ denote the total number of links. It is shown that $\,\approx\alpha/352\,$ wires can be disabled simultaneously without disrupting the ongoing computation or degrading the routing performance much. %R TR-20-89 %D 1989 %T Lower Bounds on Parallel, Distributed and Automata Computations %A Mih\'{a}ly Ger\'{e}b-Graus %X In this thesis we present a collection of lower bound results: A hierarchy of complexity classes on tree languages (analogous to the polynomial hierarchy) accepted by alternating finite state machines is introduced. By separating the deterministic and the nondeterministic classes of our hierarchy we give a negative answer to the folklore question whether the expressive power of the treeautomata is the same as that of the finite state automaton that can walk on the edges of the tree (bugautomaton). We prove that three-head one-way DFA cannot perform string-matching, that is, no three-head one-way DFA accepts the language $L=\{x\#y \mid x$ {\rm is a substring of} $y$, {\rm where} $x,y \in \{0,1\} ^*\} $. We prove that in a one round fair coin flipping (or voting) scheme with $n$ participants, there is at least one participant who has a chance to decide the outcome with probability at least $3/n-o(1/n)$. We prove an optimal lower bound on the average time required by any algorithm that merges two sorted lists on the parallel comparison tree model. We present a proof of a negative answer for a question raised by Skyum and Valiant, namely, whether the class of symmetric boolean functions has a p-complete family. We give a combinatorial characterization for the concept classes learnable from negative (or positive) examples only in the so called distribution free learning model. %R TR-21-89 %D 1989 %T A Data Mapping Parallel Language %A Vinod Kathail %A Dan C. Stefanescu %R TR-22-89 %D 1989 %T An Analysis of the Valiant-Brebner Hypercube\\ Routing Algorithm %A Athanasios Tsantilas %X We present the best known analysis of the running time of the Valiant-Brebner algorithm for routing $h$-relations in the $n$- dimensional hypercube. Refining an elegant proof technique due to Ranade we prove that for $h \sim n/\omega(n)$, where $\omega(n)$ tends to infinity arbitrarily slowly, the running time of this algorithm is $2n+O(n/\ln\omega(n))$ with very high probability. The same analysis holds for the directed $n$-butterfly as well. %R TR-23-89 %D 1989 %T Tight Bounds for Oblivious Routing in the Hypercube %A C. Kaklamanis %A D. Krizanc %A A. Tsantilas %X We prove that given an $N$-node communication network with maximum indegree $d$, any deterministic oblivious algorithm for routing an arbitrary permutation requires $\Omega(\sqrt {N} /d)$ time. This is an improvement of a result by Borodin and Hopcroft. For the $N$-node hypercube, in particular, we show a matching upper bound by exhibiting a deterministic oblivious algorithm which routes any permutation in $O(\sqrt {N} / \log N)$ time. The best previously known upper bound was $O(\sqrt {N} )$. %R TR-24-89 %D 1989 %T SIMD Algorithms for Image Rendering %A Robert J. Walsh %X A parameterized cost model for SIMD machines is developed and used to rank projective image rendering algorithms. The ranking considers such an extremely broad range of remote communication costs that it holds for all practical situations. Ranked first is a new projective algorithm presented in the thesis. Empirical results support the ranking developed from the analytic models. The cost model is also applied to ray tracing, so that precomputed approximate sorts of surface primitives can be rigorously used to speed parallel ray tracing. The general-purpose architecture assumption is justified by showing that special-purpose machines actually render images more slowly. We conjecture that neither SIMD nor MIMD architectures have a fundamental advantage when efficient algorithms are known for both machine types. The algorithms and analysis presented can serve as a model for SIMD computations in other application domains. %R TR-01-90 %D 1990 %T Perfect Privacy for Two-Party Protocols %A Donald Beaver %X We examine the problem of computing functions in a distributed manner based on private inputs, revealing the value of a function without revealing any additional information about the inputs. A function $f(x_1,\dots,x_n)$ is $t$-private if there is a protocol whereby $n$ parties, each We holding an input value, can compute $f,$ and no subset of $t$ or fewer parties gains any information other than what is computable from $f(x_1,\dots,x_n).$ The class of $t$-private {\em boolean} functions for $t \geq\lceil \frac {n} {2} \rceil$ was described in {ck89} . We give a characterization of 1-private functions for two parties, without the restriction to the boolean case. We also examine a restricted form of private computation in the $n$-party case, and show that addition is the only $t$-privately computable function in that model. Incorrect proofs of this characterization appeared in [Kushilevitz,1989] and an earlier technical report [Beaver,1989]. We present a different proof which avoids the errors of those works. This report supersedes Harvard Technical Report TR-11-89. %R TR-02-90 %D 1990 %T Cooperative Dialogues While Playing\\ Adventure %A David Albert %X To collect examples of the use of hedging phrases in informal conversation, three pairs of subjects were (separately) asked to cooperate in playing the computer game {\em Adventure} . While one typed commands into the computer, the two engaged in conversation about their options and best strategy. The conversation was recorded on audio tape, and their computer session was saved in a disk file. Subsequently, the audio tape was transcribed and merged with the computer record of the game, to produce a combined transcript showing what was typed along with a simultaneous running commentary by the participants. This report contains the complete transcripts and a discussion of the collection and transcription methodology. %R TR-03-90 %D 1990 %T M-LISP:\\ A Representation Independent Dialect of LISP with Reduction Semantics %A Robert Muller %X In this paper we propose to reconstruct LISP from first principles with an eye toward reconciling S-expression LISP's metalinguistic facilities with the kind of operational semantics advocated by Plotkin {pl81} . After reviewing the original definition of LISP we define the abstract syntax and the operational semantics of the new dialect, M-LISP, and show that its equational theory is consistent. Next we develop the operational semantics of an extension of M-LISP which features an explicitly callable {\em eval} and {\em fexprs} (i.e., procedures whose arguments are passed {\em by-representation} ). Since M-LISP is independent of any representation of its programs it has no {\em quotation} operator or any of its related forms. To compensate for this we encapsulate the shifting between mention and use which is performed globally by {\em quote} within the metalinguistic constructs that require it. The resulting equational system is shown to be inconsistent. We leave it as an open problem to find confluent variants of the metalinguistic constructs. %R TR-04-90 %D 1990 %T Syntax Macros in M-LISP:\\ A Representation Independent Dialect of LISP with Reduction Semantics %A Robert Muller %X In this paper we consider syntactic abstraction in M-LISP, a dialect of LISP which is independent of any representation of its programs. Since it is independent of McCarthy's original Meta- expression representation scheme, M-LISP has no {\em quotation} form or any of its related forms {\em backquote} , {\em unquote} or {\em unquote-splicing} . Given that LISP macro systems depend on this latter representation model, it is not obvious how to base syntax extensions in M-LISP. Our approach is based on an adaptation of the {\em Macro-by-Example} {kowa87} and {\em Hygienic} algorithms {kofrfedu86} . The adaptation for the quotation-free syntactic structure of M-LISP yields a substantially different model of syntax macros. The most important difference is that $\lambda$ binding patterns become apparent when an abstraction is first (i.e., partially) transcribed in the syntax tree. This allows us to define tighter restrictions on the capture of identifiers. This is not possible in S-expression dialects such as Scheme since $\lambda$ binding patterns are not apparent until the tree is completely transcribed. %R TR-05-90 %D 1990 %T Semantics Prototyping in M-LISP\\ (Extended Abstract) %A Robert Muller %X In this paper we describe a new semantic metalanguage which simplifies prototyping of programming languages. The system integrates Paulson's semantic grammars within a new dialect of LISP, M-LISP, which has somewhat closer connections to the $\lambda$-calculus than other LISP dialects such as Scheme. The semantic grammars are expressed as attribute grammars. The generated parsers are M-LISP functions that can return denotational (i.e., higher-order) representations of abstract syntax. We illustrate the system with several examples and compare it to related systems. %R TR-06-90 %D 1990 %T ESPRIT\\ Executable Specification of Parallel Real-time Interactive Tasks %A Azer Bestavros %X The vital role that embedded systems are playing and will continue to play in our world, coupled with their increasingly complex and critical nature, demand a rigorous and systematic treatment that recognizes their unique requirements. The Time- constrained Reactive Automaton (TRA) is a formal model of computation that admits these requirements. Using the TRA model, an embedded system is viewed as a set of {\em asynchronously} interacting automata (TRAs), each representing an {\em autonomous} system entity. TRAs are {\em input enabled} ; they communicate by signaling events on their {\em output channels} and by reacting to events signaled on their {\em input channels} . The behavior of a TRA is governed by {\em time-constrained causal relationships} between {\em computation-triggering} events. The TRA model is {\em compositional} and allows time, control, and computation {\em non-determinism} . In this paper we present ESPRIT, a specification language that is entirely based on the TRA model. We have developed a compiler that allows ESPRIT specifications to be executed in simulated time, thus, providing a valuable validation tool for embedded system specifications. We are currently developing another compiler that would allow the execution of ESPRIT specifications in real-time, thus, making it possible to write real-time programs directly in ESPRIT. %R TR-07-90 %D 1990 %T A Logic of Concrete Time Intervals\\ (Extended Abstract) %A Harry R. Lewis %X This paper describes (1) a finite-state model for asynchronous systems in which the time delays between the scheduling and occurrence of the events that cause state changes are constrained to fall between fixed numerical upper and lower time bounds; (2) a branching-time temporal logic suitable for describing the temporal and logical properties of asynchronous systems, for which the structures of (1) are the natural models; and (3) a functional verification system for asynchronous circuits, which generates, from a boolean circuit with general feedback and specified min/max rise and fall times for the gates, a finite-state structure as in (1), and then exhaustively checks a formal specification of that circuit in the language (2) against that finite-state model. %R TR-08-90 %D 1990 %T Generating Descriptions that Exploit a User's Domain\\ Knowledge %A Ehud Reiter %X Natural language generation systems should customize object descriptions according to the extent of their user's domain and lexical knowledge. The task of generating customized descriptions is formalized as a task of finding descriptions that are {\it accurate} (truthful), {\it valid} (fulfill the speaker's communicative goal), and {\it free of false implicatures} (do not give rise to unintended conversational implicatures) with respect to the current user model. An algorithm that generates descriptions that meet these constraints is described, and the computational complexity of the generation problem is discussed. %R TR-09-90 %D 1990 %T The Computational Complexity of Avoiding Conversational\\ Implicatures %A Ehud Reiter %X Referring expressions and other object descriptions should be maximal under the Local Brevity, No Unnecessary Components, and Lexical Preference preference rules; otherwise, they may lead hearers to infer unwanted conversational implicatures. These preference rules can be incorporated into a polynomial time generation algorithm, while some alternative formalizations of conversational implicature make the generation task NP-Hard. %R TR-10-90 %D 1990 %T Generating Appropriate Natural Language Object Descriptions %A Ehud Baruch Reiter %X Natural language generation (NLG) systems must produce different utterances for users with different amounts of domain and lexical knowledge. A utterance that is meant to be read by an expert should use technical vocabulary, and avoid explicitly mentioning facts the expert can immediately infer from the rest of the utterance. In contrast, an utterance that is meant to be read by a novice should avoid specialized vocabulary, and may be required to explicitly mention facts that would be obvious to an expert. An NLG system that does not customize utterances according to its user's domain and lexical knowledge may generate text that is incomprehensible to a novice, or text that leads an expert to infer unwanted {\it conversational implicatures} (Grice 1975). This thesis examines the problem of generating attributive descriptions of individuals, that is object descriptions that are intended to inform the user that a particular object has certain attributes. It proposes that such descriptions will be appropriate for a particular user if they are {\it accurate, valid} , and {\it free of false implicatures} with respect to a user-model that represents that user's relevant domain and lexical knowledge. Descriptions are represented as definitions of KL-ONE- type (Brachman and Schmolze 1985) classes, and a description is called {\it accurate} if it defines a class that subsumes the object being described; {\it valid} if every attribute the system wishes to communicate is either part of the description or a default attribute that is inherited by the class defined by the description; and {\it free of false implicatures} if it is maximal under three preference rules: No Unnecessary Components, Local Brevity, and Lexical Preference. %R TR-11-90 %D 1990 %T Domain Theory for Nonmonotonic Functions %A Yuli Zhou %A Robert Muller %X We prove several lattice theoretical fixpoint theorems based on the classical result of Knaster-Tarski. These theorems give sufficient conditions for a system of generally nonmonotonic functions, on a complete lattice to define a unique fixpoint. The primary objective of this paper is to develop a domain theoretic framework to study the semantics of general logic programs as well as various rule-based systems where the rules define nonmonotonic functions on lattices. %R TR-12-90 %D 1990 %T A Syntax and Semantics for Network Diagrams %A Joe Marks %X The ability to automatically design graphical displays of data will be important for the next generation of interactive computer systems. The research reported here concerns the automated design of network diagrams, one of the three main classes of symbolic graphical display (the other two being chart graphs and maps). Previous notions of syntax and semantics for network diagrams are not adequate for automating the design of this kind of graphical display. I present here a new formulation of syntax and semantics for network diagrams that is used in the ANDD (Automated Network Diagram Designer) system. The syntactic formulation differs from previous work in two significant ways: perceptual-organization phenomena are explicitly represented, and syntax is described in terms of constraints rather than as a grammar of term-rewriting rules. The semantic formulation is based on an application- independent model of network systems that can be used to model many real-world applications. The paper includes examples that show how these concepts are used by ANDD to automatically design network diagrams. %R TR-13-90 %D 1990 %T Avoiding Unwanted Conversational Implicatures in Text and Graphics %A Joe Marks %A Ehud Reiter %X We have developed two systems, FN and ANDD, that use natural language and graphical displays, respectively, to communicate information about objects to human users. Both systems must deal with the fundamental problem of ensuring that their output does not carry unwanted and inappropriate conversational implicatures. We describe the types of conversational implicatures that FN and ANDD can avoid, and the computational strategies the two systems use to generate output that is free of unwanted implicatures. %R TR-14-90 %D 1990 %T Models of Plans to Support Communications:\\ An Initial Report %A Karen E. Lochbaum %A Barbara J. Grosz %A Candace L. Sidner %X Agents collaborating to achieve a goal bring to their joint activity different beliefs about ways in which to achieve the goal and the actions necessary for doing so. Thus, a model of collaboration must provide a way of representing and distinguishing among agents' beliefs and of stating the ways in which the intentions of different agents contribute to achieving their goal. Furthermore, in collaborative activity, collaboration occurs in the planning process itself. Thus, rather than modelling plan recognition, per se, what must be modelled is the {\it augmentation} of beliefs about the actions of multiple agents and their intentions. In this paper, we modify and expand the SharedPlan model of collaborative behavior (Grosz \& Sidner 1990). We present an algorithm for updating an agent's beliefs about a partial SharedPlan and describe an initial implementation of this algorithm in the domain of network management. %R TR-15-90 %D 1990 %T An Information Dispersal Approach to Issues in Parallel Processing %A Yuh-Dauh Lyuu %X Efficient schemes for the following issues in parallel processing are presented: fast communication, low congestion, fault tolerance, simulation of ideal parallel computation models, synchronization in asynchronous networks, low sensitivity to variations in component speed, and on-line maintenance. All our schemes employ Rabin's information dispersal idea. We also develop an efficient information dispersal algorithm (IDA) based on the Fast Fourier Transform and an IDA-based voting scheme to enforce fault tolerance. Let $N$ denote the size of the hypercube network. We present a randomized communication scheme, FSRA (for ``Fault-tolerant Subcube Routing Algorithm''), that routes in $2 \cdot \log N + 1$ time using only constant size buffers and with probability of success $1 - N^ {-\Theta (\log N)} $. (All log's are to the base 2.) FSRA also tolerates $O(N)$ random link failures with high probability. Similar results are also obtained for the de Bruijn and the butterfly networks (without fault tolerance in the latter case). FSRA is employed to simulate, without using hashing, a class of CRCW PRAM (concurrent-read concurrent-write parallel random access machine) programs with a slowdown of $O(\log N)$ with almost certainty if combining is used. A fault-tolerant simulation scheme for general CRCW PRAM programs is also presented. A simple acknowledgement synchronizer can make all our routing schemes in this dissertation run on asynchronous networks without loss of efficiency. We further show that speed of any component--be it a processor or a link--has only linear impact on the run-time of FSRA; that is, the extra delay in run-time is only proportional to the drift in the component's delay and is independent of the size of the network. On-line maintainability makes the machine more available to the user. We show that, under FSRA, a constant fraction of the links can be disabled with essentially no impact on the routing performance. This result immediately suggests several efficient maintenance procedures. Based on the above results, a fault-tolerant parallel computing system, called HPC (for ``hypercube parallel computer''), is sketched at the end of this dissertation. %R TR-16-90 %D 1990 %T Identifying $\mu$-Formula Decision Trees with Queries %A Thomas R. Hancock %X We consider a learning problem for the representation class of $\mu$-formula decision trees, a generalization of $\mu$-formulas and $\mu$-decision trees. (The ``$\mu$'' form of a representation has the restriction that no variable appears more than once). The learning model is one of exact identification by oracle queries, where learner's goal is to discover an unknown function by asking membership queries (is the function true on some specified input?) and equivalence queries (is the function identical to some hypothesis we present, and if not what is an input on which they differ?). We present an identification algorithm using these two types of queries that runs in time polynomial in the number of variables, we and show that no such polynomial time algorithm exists that uses either membership or equivalence queries alone (in the latter case under the stipulation that the hypotheses are drawn from the same representation class). We further extend the algorithm to identify a broader class where the formulas taken over a more powerful basis including arbitrary threshold gates. %R TR-17-90 %D 1990 %T Design and Modeling with Schema Grammars %A Mark Friedell %A Sandeep Kochhar %X Graphical scene modeling is usually a time-consuming, tedious, and expensive manual activity. This paper proposes an approach to partially automating the process though a paradigm for cooperative user-computer scene modeling that we call {\em cooperative computer-aided design} (CCAD). Formal grammars, referred to as {\em schema grammars} , are used to imbue the modeling system with an elementary ``understanding'' of the kinds of scenes to be created. The grammar interpreter constructs part or all of the scene after accepting from the user partially completed scene components and descriptions of scene properties that must be incorporated into the final scene. This approach to modeling harnesses the power of the computer to construct scene detail, thereby freeing the human user to focus on essential creative decisions. This paper describes the structure and interpretation of schema grammars, and provides techniques for controlling the combinatorial explosion that can result from the undirected interpretation of the grammars. CCAD is explored in the context of two experimental systems---FLATS, for architectural design, and LG, for modeling landscapes. %R TR-18-90 %D 1990 %T Cooperative Computer-Aided Design: A Paradigm for Automating the Design and Modeling of Graphical Objects %A Sandeep Kochhar %X Design activity is often characterized by a search in which the designer examines various alternatives at several stages during the design process. Current computer-aided design (CAD) systems, however, provide very little support for this exploratory aspect of design. My research provides the foundation for {\em cooperative computer-aided design (CCAD)} ---a novel CAD technology that intersperses partial exploration of design alternatives by the computer with guiding design operations by the system user. CCAD combines the strengths of manual and automated design and modeling by allowing the user to make creative decisions and provide specialized detail, while exploiting the power of the machine to explore many design alternatives and create detail that is resonant with the user's design decisions. In the CCAD paradigm, the user expresses initial design decisions in the form of a partial design and a set of properties that the final design must satisfy. The user then initiates the generation by the system of alternative {\em partial} developments of the initial design subject to a ``language'' of valid designs. The results are then structured in a spatial framework through which the user moves to explore the alternatives. The user selects the most promising partial design, refines it manually, and then requests further automatic development. This process continues until a satisfactory complete design is created. I present the interpretation of schema grammars---a class of generative grammars for manipulating graphical objects---as the fundamental generative mechanism underlying CCAD. I describe in detail several mechanisms for providing user control over the generative process, thereby controlling the combinatorial explosion inherent in an unrestricted, undirected interpretation of generative grammars. I explore graphical browsing as a facility for efficiently perusing the set of design alternatives generated by the system. I also describe FLATS ( {\bf F} loor plan {\bf LA} you {\bf T} {\bf S} ystem)---a prototype CCAD system for the design of small architectural floor plans. %R TR-19-90 %D 1990 %T Understanding Subsumption and Taxonomy:\\ A Framework for Progress %A William A. Woods %X This paper continues a theme begun in my paper, ``What's in a Link,'' -- seeking a solid foundation for network representations of knowledge. Like its predecessor, its goal is to clarify issues and establish a framework for progress. The paper analyzes the concepts of subsumption and taxonomy and synthesizes a framework that integrates and clarifies many previous approaches and goes beyond them to provide an account of abstract and partially defined concepts. The distinction between definition and assertion is reinterpreted in a framework that accommodates probabilistic and default rules as well as universal claims and abstract and partial definitions. Conceptual taxonomies in this framework are shown to be useful for indexing and organizing information and for managing the resolution of conflicting defaults. The paper introduces a distinction between intensional and extensional subsumption and argues for the importance of the former. It presents a classification algorithm based on intensional subsumption and shows that its typical case complexity is logarithmic in the size of the knowledge base. %R TR-20-90 %D 1990 %T The KL-ONE Family %A William A. Woods %A James G. Schmolze %X The knowledge representation system KL-ONE has been one of the most influential and imitated knowledge representation systems in the Artificial Intelligence community. Begun at Bolt Beranek and Newman in 1978, KL-ONE pioneered the development of taxonomic representations that can automatically classify and assimilate new concepts based on a criterion of terminological subsumption. This theme generated considerable interest in both the formal community and a large community of potential users. The KL-ONE community has since expanded to include many systems at many institutions and in many different countries. This paper introduces the KL-ONE family and discusses some of the main themes explored by KL-ONE and its successors. We give an overview of current research, describe some of the systems that have been developed, and outline some future research directions. %R TR-21-90 %D 1990 %T Efficiency of Semi-Synchronous versus Asynchronous Networks %A Hagit Attiya %A Marios Mavronicolas %X The $s$-session problem is studied in {\em asynchronous} and {\em semi-synchronous} networks. Processes are located at the nodes of an undirected graph $G$ and communicate by sending messages along links that correspond to the edges of $G$. A session is a part of an execution in which each process takes at least one step; an algorithm for the $s$-session problem guarantees the existence of at least $s$ disjoint sessions. The existence of many sessions guarantees a degree of interleaving which is necessary for certain computations. It is assumed that the (real) time for message delivery is at most $d$. In the asynchronous model, it is assumed that the time between any two consecutive steps of any process is in the interval $[0,1]$; in the semi-synchronous model, the time between any two consecutive steps of any process is in the interval $[c,1]$ for some $c$ such that $0 < c \leq 1$, the {\em synchronous} model being the special case where $c=1$. In the {\em initialized} case of the problem, all processes are initially synchronized and take a step at time 0. For the asynchronous model, an fupper bound of $diam(G)(d+1)(s-1)$ and a lower bound of $diam(G)d(s-1)$ are presented; $diam(G)$ is the {\em diameter} of $G$. For the semi- synchronous model, an upper bound of $1+\min\{\lfloor \frac {1} {c} \rfloor +1, diam(G)(d+1)\} (s-2)$ is presented. The main result of the paper is a lower bound of $1+\min\{\lfloor \frac {1} {2c} \rfloor, diam(G)d\} (s-2)$ for the time complexity of any semi-synchronous algorithm for the $s$-session problem, under the assumption that $d \geq \frac {d} {\min\{\lfloor\frac {1} {2c} \rfloor, diam(G)d\}} + 2$. These results imply a time separation between semi-synchronous (in particular, synchronous) and asynchronous networks. Similar results are proved for the case where delays are not uniform. In the {\em uninitialized} case of the problem, all processes but one, the {\em initiator} , start in a {\em quiescent} state which they may leave upon receiving a message. Similar results are proved for this case. %R TR-22-90 %D 1990 %T Efficient execution of homogeneous tasks with unequal run times on the Connection Machine %A Azer Bestavros %A Thomas Cheatham %X A lot of scientific applications require the execution of a large number of identical {\em tasks} , each on a different set of data. Such applications can easily benefit from the power of SIMD architectures ( {\em e.g. the Connection Machine} ) by having the array of processing elements (PEs) execute the task in parallel on the different data sets. It is often the case, however, that the task to be performed involves the repetitive application of the same sequence of steps, {\em a body} , for a number of times that depend on the input or computed data. If the usual {\em task-level synchronization} is used, the utilization of the array of PEs degrades substantially. In this paper, we propose a {\em body- level synchronization} scheme that would boost the utilization of the array of PEs while keeping the required overhead to a minimum. We mathematically analyze the proposed technique and show how to optimize its performance for a given application. Our technique is especially efficient when the number of tasks to be executed is much larger than the number of physical PEs available. %R TR-23-90 %D 1990 %T Modelling Act-Type Relations in Collaborative Activity %A Cecile T. Balkanski %X Intelligent agents collaborating to achieve a common goal direct a significant portion of their effort to planning together. Because having a plan to perform a given task involves having knowledge about the ways in which the performance of a set of actions will lead to the performance of that task, the representation of actions and of relations among them is a central concern. This paper provides a formalism for representing act-type relations and complex act-type constructors in multiagent domains. To determine a set of relations that would span the space of complex collaborative actions, I analyzed a videotape of two people building a piece of furniture, and identified a group of relations that are adequate to represent the relationships among the actions occurring in the data. The definitions provided here are more complex than those used in earlier studies; in particular, I refine and expand Pollack's (1986) set of act-type relations (generation and enablement), provide constructors for building complex act-types from (simpler) act-types (simultaneity, conjunction, sequence and iteration), and make it possible to represent the joint actions of multiple agents. %R TR-24-90 %D 1990 %T Security, Fault Tolerance, and Communication Complexity in Distributed Systems %A Donald Rozinak Beaver %X We present efficient and practical algorithms for a large, distributed system of processors to achieve reliable computations in a secure manner. Specifically, we address the problem of computing a general function of several private inputs distributed among the processors of a network, while ensuring the correctness of the results and the privacy of the inputs, despite accidental or malicious faults in the system. Communication is often the most significant bottleneck in distributed computing. Our algorithms maintain a low cost in local processing time, are the first to achieve optimal levels of fault-tolerance, and most importantly, have low communication complexity. In contrast to the best known previous methods, which require large numbers of rounds even for fairly simple computations, we devise protocols that use small messages and a constant number of rounds {\em regardless} of the complexity of the function to be computed. Through direct algebraic approaches, we separate the {\em communication complexity of secure computing} from the {\em computational complexity} of the function to be computed. We examine security under both the modern approach of computational complexity-based cryptography and the classical approach of unconditional, information-theoretic security. We develop a clear and concise set of definitions that support formal proofs of claims to security, addressing an important deficiency in the literature. Our protocols are provably secure. In the realm of information-theoretic security, we characterize those functions which two parties can compute jointly with absolute privacy. We also characterize those functions which a weak processor can compute using the aid of powerful processors without having to reveal the instances of the problem it would like to solve. Our methods include a promising new technique called a {\em locally random reduction} , which has given rise not only to efficient solutions for many of the problems considered in this work but to several powerful new results in complexity theory. %R TR-25-90 %D 1990 %T Communication Issues in Parallel Computation %A Athanasios M. Tsantilas %X This thesis examines the problem of interprocessor communication in realistic parallel computers. In particular, we consider the problem of permutation routing and its generalizations in the mesh, hypercube and butterfly networks. Building on previous research, we derive lower bounds for a wide class of deterministic routing algorithms which imply that such algorithms create heavy traffic congestion. In contrast, we show that randomized routing algorithms result in both efficient and optimal upper bounds in the above networks. Experiments were also performed to test the behaviour of the randomized algorithms. These experiments suggest interesting theoretical problems. We also examine the problem of efficient interprocessor communication in a model suggested by recent advances in optical computing. The main argument of this thesis is that communication can be made efficient if randomization is used in the routing algorithms. %R TR-01-91 %D 1991 %T QCD on the Connection Machine: Beyond $^*\hbox{LISP}$ %A Ralph G. Brickner %A Clive F. Baillie %A S. Lennart Johnsson %X We report on the status of code development for a simulation of Quantun Chromodynamics (QCD) with dynamical Wilson fermions on the Connection Machine model CM-2. Our original code, written in \*Lisp, gave performance in the near -GFLOPS range. We have rewritten the most time-consuming parts of the code in the low-level programming system CMIS, including the matrix multiply and the communication. Current versions of the code run at approximately 3.6 GFLOPS for the fermion matrix inversion, and we expect the next version to reach or exceed 5 GFLOPS. %X tr-01-91.ps.gz %R TR-02-91 %D 1991 %T Communication and I/O Libraries %A S. Lennart Johnsson %A Patrick Worley %X tr-02-91.ps.gz %R TR-03-91 %D 1991 %T Optimal Communication Channel Utilization for Matrix Transposition and Related Permutations on Boolean Cubes %A S. Lennart Johnsson %A Ching-Tien Ho %X TR-16-92 SUPERCEDES TR-03-91 %R TR-04-91 %D 1991 %T Generalized Shuffle Permutations on Boolean Cubes %A S. Lennart Johnsson %A Ching-Tien Ho %X In a {\em generalized shuffle permutation} an address \ $ (a_ {q- 1} -a_ {q-2} \ldots a_ {} ) $ \ receives its content from an address obtained through a cyclic shift on a subset of the \ $ q $ \ dimensions used for the encoding of the addresses. Bit- complementation may be combined with the shift. We give an algorithm that requires \ $ \frac {K} {2} + 2 $ \ exchanges for \ $ K $ \ elements per processor, when storage dimensions are part of the permutation, and concurrent communication on all ports of every processor is possible. The number of element exchanges in sequence is independent of the number of processor dimensions \ $ \sigma_ {r} $ \ in the permutation. With no storage dimensions in the permutation our best algorithm requires \ $ (\sigma_ {r} + 1) \lceil \frac {K} {2 \sigma_ {r}} \rceil $ \ element exchanges. We also give an algorithm for \ $ \sigma_ {r} = 2, $ \ or the real shuffle consists of a number of cycles of length two, that requires \ $ \frac {K} {2} + 1 $ \ element exchanges in sequence when there no bit complement. The lower bound is \ $ \frac {K} {2} $ \ for both real and mixed shuffles with no bit complementation. The minimum number of communication start-ups is \ $ \sigma_ {r} $ \ for both cases, which is also the lower bound. The data transfer time for communication restricted to one port per processor is \ $ \sigma_ {r} \frac {K} {2} , $ \ and the minimum number of start-ups is \ $ \sigma_ {r} . $ \ The analysis is verified by experimental results on the Intel iPSC/1, and for one case also on the Connection Machine. %X tr-04-91.ps.gz %R TR-05-91 %D 1991 %T The Computational Complexity of Cartographic Label\\ Placement %A Joe Marks %A Stuart Shieber %X We examine the computational complexity of cartographic label placement, a problem derived from the cartographer's task of placing text labels adjacent to map features in such a way as to minimize overlaps with other labels and map features. Cartographic label placement is one of the most time-consuming tasks in the production of maps. Consequently, several attempts have been made to automate the label-placement task for some or all classes of cartographic features (punctual, linear, or areal features), but all previously published algorithms for the most basic task---point-feature-label placement---either exhibit worst- case exponential time complexity, or incorporate incomplete heuristics that may fail to find an admissible labeling even when one exists. The computational complexity of label placement is therefore a matter of practical significance in automated cartography. We show that admissible label placement is NP- complete, even for very simple versions of the problem. Thus, no polynomial time algorithm exists unless $P=NP$. Similarly, we show that optimal label placement can be solved in polynomial time if and only if $P=NP$, and this result holds even if we require only approximately optimal placements. The results are especially interesting because cartographic label placement is one of the few combinatorial problems that remains NP-hard even under a geometric (Euclidean) interpretation. The results are of broader practical significance, as they also apply to point-feature labeling in non- cartographic displays, e.g., the labeling of points in a scatter plot. %X tr-05-91.ps.gz %R TR-06-91 %D 1991 %T A Graphical Editor for Three-Dimensional Constraint-Based\\ Geometric Modeling %A Steven John Sistare %X The design of geometric models can be a painstaking and time- consuming task. Typical CAD packages are often primitive or deficient in the means they offer for placing geometry in a design or for subsequently modifying the geometry. Modification in particular can tax a user's patience when it requires many deletion, creation, or perturbation operations to effect a conceptually simple change in the design. One area of research that attempts to address these deficiencies involves the use of constraints on the geometry as a means of both specifying and controlling its shape. The form in which the constraint information is demanded from the user determines the ease of use of any geometric system that is based on constraints, and is one of the basic problems to be addressed in the design of such a system. I present a constraint-based geometric editor that allows the manipulation of both constraints and geometry using the direct -manipulation paradigm, which is well established as being of central importance in many easy-to-use systems. When using the editor, constraints are presented to the user graphically, in the context of the geometric design, and may be created, destroyed, and manipulated interactively along with the geometry. The constraints may either be created explicitly, or implicitly as a side effect of creating geometry. In addition, constraints are used in a novel way to facilitate interactive creation and positioning of geometry in three space, despite the limitations of commonly-available two-dimensional display and input devices. Lastly, whenever geometry is modified using direct manipulation, a solver is called which updates the geometry in accordance with the existing constraints. All of these features contribute to the ease of use of my system. I also present a solver that addresses another basic problem inherent in constraint-based systems; namely, the need to efficiently obtain a solution that instantiates the geometry so as to satisfy the constraints. I provide a robust and efficient solver that is $O(n^2)$ in the size of the geometric problem being solved and present the mathematics that support it. In addition, I present a new algorithm that partitions the free-form geometry and constraint network into a number of pieces that may be independently solved. For many networks, this algorithm yields a solution to the entire network in close to linear time. %R TR-07-91 %D 1991 %T Tight Upper and Lower Bounds on the Path Length\\ of Binary Trees %A Alfredo De Santis %A Giuseppe Persiano %X The {\em external path length} of a tree $T$ is the sum of the lengths of the paths from the root to each external node. The {\em maximal path length difference,} $\triangle$, is the difference between the length of the longest and shortest such path. We prove tight lower and upper bounds on the external path length of binary trees with $N$ external nodes and prescribed maximal path length difference $\triangle$. In particular, we give an upper bound that, for each value of $\triangle$, can be exactly achieved for infinitely many values of $N$. This improves on the previously known upper bound that could only be achieved up to a factor proportional $N$. We then use the upper bound to give a simple upper bound on the path length of Red-Black trees which is asymptotically tight. We also present, as a preliminary result, an elementary proof of the known upper bound. We finally prove a lower bound which can be exactly achieved for each value of $N$ and $\triangle \leq N/2$. %R TR-08-91 %D 1991 %T Abstract Semantics of First-Order Recursive Schemes %A Robert Muller %A Yuli Zhou %X We develop a general framework for deriving abstract domains from concrete semantic domains in the context of first-order recursive schemes and prove several theorems which ensure the correctness (safety) of abstract computations. The abstract domains, which we call {\em Weak Hoare powerdomains} , subsume the roles of both the abstract domains and the collecting interpretations in the abstract interpretation literature. %R TR-09-91 %D 1991 %T Semantic Domains for Abstract Interpretation %A Robert Muller %A Yuli Zhou %X In this paper we consider abstract interpretation of PCF programs. The main development is the extension of {\em weak powerdomains} to higher types. In the classical abstract interpretation approach, abstract domains are constructed explicitly and the abstract semantics is then related to the concrete semantics. In the approach introduced here, abstract domains are {\em derived} directly from concrete domains. The conditions for deriving the domains are intended to be as general as possible while still guaranteeing that the derived domain has sufficient structure so that it can be used as a basis for computing correct information about the concrete semantics. We prove three main theorems, the last of which ensures the correctness of abstract interpretation of PCF programs given safe interpretations of the constants. This generalizes earlier results obtained for the special case of strictness analysis. %R TR-10-91 %D 1991 %T Performance Modeling of Distributed Memory Architectures %A S. Lennart Johnsson %X We provide performance models for several primitive operations on data structures distributed over memory units interconnected by a Boolean cube network. In particular, we model single source, and multiple source concurrent broadcasting or reduction, concurrent gather and scatter operations, shifts along several axes of multi- dimensional arrays, and emulation of butterfly networks. We also show how the processor configuration, data aggregation, and the encoding of the address space affect the performance for two important basic computations: the multiplication of arbitrarily shaped matrices, and the Fast Fourier Transform. We also give an example of the performance behavior for local matrix operations for a processor with a single path to local memory, and a set of registers. The analytic models are verified by measurements on the Connection Machine model CM-2. %X tr-10-91.ps.gz %R TR-11-91 %D 1991 %T Outerjoins---How to Extend a Conventional Optimizer %A C\'{e}sar Galindo-Legaria %A Arnon Rosenthal %X Free choice among join orderings is one of the most powerful optimizations in a conventional optimizer. But the freedom is limited to Select/Project/Join queries. In this paper, we extend this freedom to queries that include outerjoins. Unlike previous work, these results are not limited to queries possessing a ``nice structure,'' or queries that are nicely represented in relational calculus. Our theoretical results concern query ``simplification'' and reassociation using a generalized outerjoin. We show how the necessary computation can be added rather easily to the join-order generation of a conventional query optimizer. %R TR-12-91 %D 1991 %T An Algorithm for Plan Recognition in Collaborative Discourse %A Karen E. Lochbaum %X A model of plan recognition in discourse must be based on intended recognition, distinguish each agent's beliefs and intentions from the other's, and avoid assumptions about the correctness or completeness of the agents' beliefs. In this paper, we present an algorithm for plan recognition that is based on the SharedPlan model of collaboration and that satisfies these constraints. %R TR-13-91 %D 1991 %T Object-Oriented Programming for Massively Parallel Machines %A Michael F. Kilian %X Large, robust massively parallel programs that are understandable (and therefore maintainable) are not yet a reality. Such programs require a programming methodology that minimizes the conceptual differences between the program and the domain addressed by the program, encourages reusability, and still produces robust programs that are readily maintained and reasoned about. This paper proposes the parallel object-oriented model. The model is constructed from an object-oriented methodology augmented by constructs and semantics for parallel processing, and satisfies the requirements for building large parallel applications. It presents a unique way of representing object references and of managing concurrent access to objects. The methodology may be extended for a wide range of computing platforms and application areas. %R TR-14-91 %D 1991 %T Plan Recognition in Collaborative Discourse %A Karen E. Lochbaum %X A model of plan recognition in discourse must be based on intended recognition, distinguish each agent's beliefs and intentions from the other's, and avoid assumptions about the correctness or completeness of the agents' beliefs. In this paper, we present an algorithm for plan recognition that is based on the SharedPlan model of collaboration and that satisfies these constraints. %R TR-15-91 %D 1991 %T Elliptic Curves in Computer Science:\\ Primality Testing, Factoring, and Cryptography %A Michael Mitzenmacher %R TR-16-91 %D 1991 %T Identifiability is Closed under Embeddings in Read-Once\\ Formulas or $\mu$-Decision Trees %A Thomas R. Hancock %X We show a general positive result that allows us to boost the expressiveness of projection closed classes of boolean functions that are identifiable with membership and equivalence queries. Such classes include monotone DNF formulas, read-once formulas, conjunctions of Horn clauses, switch configurations, $\mu$-formula decision trees, and read-twice DNF formulas. We show that when representations from such classes (rather than single literals) are tested at the leaves of a read-once formula, or on the internal nodes of a $\mu$-decision tree, the resulting representation class is still identifiable with membership and equivalence queries. The additional overhead in time and queries is polynomial. %R TR-17-91 %D 1991 %T Computational Complexity of a Problem in Molecular Structure Prediction %A J. Thomas Ngo %A Joe Marks %X The computational task of protein-structure prediction is believed to require exponential time, but previous arguments as to its intractability have taken into account only the size of a protein's conformational space. Such arguments do not rule out the possible existence of an algorithm, more selective than exhaustive search, that is efficient and exact. (An {\em efficient} algorithm is one that is guaranteed, for all possible inputs, to run in time bounded by a function polynomial in the problem size. An {\em intractable} problem is one for which no efficient algorithm exists.) Questions regarding the possible intractability of problems are often best answered using the theory of NP-completeness. In this treatment we show the NP- hardness of two typical mathematical statements of empirical potential energy function minimization for macromolecules. Unless all NP-complete problems can be solved efficiently, these results imply that a function-minimization algorithm can be efficient for protein-structure prediction only if it exploits protein-specific properties that prohibit the simple geometric constructions that we use in our proofs. Analysis of further mathematical statements of molecular structure prediction could constitute a systematic methodology for identifying sources of complexity in protein folding, and for guiding development of predictive algorithms. %X tr-17-91.ps.gz %R TR-18-91 %D 1991 %T Optimal All-to-All Personalized Communication with Minimum Span on Boolean Cubes %A S. Lennart Johnsson %A Ching-Tien Ho %X All-to-all personalized communication is a class of permutations in which each processor sends a unique message to every other processor. We present optimal algorithms for concurrent communication on all channels in Boolean cube networks, both for the case with a single permutation, and the case where multiple permutations shall be performed on the same local data set, but on different sets of processors. For \ $ K $ \ elements per processor our algorithms give the optimal number of elements transfer, \ $ K/2. $ \ For a succession of all-to-all personalized communications on disjoint subcubes of \ $ \beta $ \ dimensions each, our best algorithm yields \ $ \frac {K} {2} + \sigma - \beta $ \ element exchanges in sequence, where \ $ \sigma $ \ is the total number of processor dimensions in the permutation. An implementation on the Connection Machine of one of the algorithms offers a maximum speed-up of 50\% compared to the previously best known algorithm. %X tr-18-91.ps.gz %R TR-19-91 %D 1991 %T Matrix Multiplication on Hypercubes Using Full Bandwidth and Constant Storage %A Ching-Tien Ho %A S. Lennart Johnsson %A Alan Edelman %X For matrix multiplication on hypercube multiprocessors with the product matrix accumulated in place a processor must receive about \ $ P^ {2} /\sqrt {N} $ \ elements or each input operand, with operands of size \ $ P \times P $ \ distributed evenly over \ $ N $ \ processors. With concurrent communication on all ports, the number of element transfers in sequence can be reduced to \ $ P^ {2} /\sqrt {N} \log N $ \ for each input operand. We present a two-level partitioning of the matrices and an algorithm for the matrix multiplication with optimal data motion and constant storage. The algorithm has sequential arithmetic complexity \ $ 2P^ {3} , $ \ and parallel arithmetic complexity \ $ 2P^ {3} /N. $ \ The algorithm has been implemented on the Connection Machine model CM-2. For the performance on the 8K CM-2, we measured about 1.6 Gflops, which would scale up to about 13 Gflops for a 64K full machine. %X tr-19-91.ps.gz %R TR-20-91 %D 1991 %T On the Conversion between Binary Code and Binary-Reflected Gray Code on Boolean Cubes %A S. Lennart Johnsson %A Ching-Tien Ho %X We present a new algorithm for conversion between binary code and binary-reflected Gray code that requires approximately \ $ \frac {2K} {3} $ \ element transfers in sequence for \ $ K $ \ elements per node, compared to \ $ K $ \ element transfers for previously known algorithms. For a Boolean cube of \ $ n = 2 $ \ dimensions the new algorithm degenerates to yield a complexity of \ $ \frac {K} {2} + 1 $ \ element transfers, which is optimal. The new algorithm is optimal within a factor of \ $ \frac {1} {3} $ \ for any routing strategy. We show that the minimum number of element transfers for minimum path length routing is \ $ K $ \ with concurrent communication on all channels of every node of a Boolean cube. %X tr-20-91.ps.gz %R TR-21-91 %D 1991 %T All-to-All Broadcast and Applications on the Connection\\ Machine %A Jean-Philippe Brunet %A S. Lennart Johnsson %X An all-to-all broadcast routing algorithm that allows concurrent communication on all channels of the Connection Machine Boolean cube network is described. Explicit routing formulas are given for both the physical broadcast between processors, and the virtual broadcast within processors. Implementation issues are addressed and timings for the physical and virtual broadcast are given for the Connection Machine system CM-2. The peak data transfer rate for the physical broadcast on a 64k CM-2 is 4.1 Gbytes/sec, and the peak rate for the virtual broadcast is about 20 Gbytes/sec. Reshaping of arrays is shown experimentally to reduce the broadcast time by a factor of up to 7 by reducing the amount of local data motion. Finally, we also show how to exploit symmetry for computation of an interaction matrix using the all-to-all broadcast function. Further optimizations are suggested for \ $N- $body type calculations. Using the all-to-all broadcast function, a peak rate of 5 Gflops/s has been achieved for the \ $ N-$body computations in 32-bit precision on a 64k CM-2. %X tr-21-91.ps.gz %R TR-22-91 %D 1991 %T New Approaches to Automating Network-Diagram Layout %A Corey Kosak %A Joe Marks %A Stuart Shieber %X Network diagrams are a familiar graphic form that can express many different kinds of information. The problem of automating network -diagram layout has therefore received much attention. Previous research on network-diagram layout has focused on the problem of aesthetically optimal layout, using such criteria as the number of link crossings, the sum of all link lengths, and total diagram area. In this paper we propose a restatement of the network- diagram-layout problem in which layout-aesthetic concerns are subordinated to perceptual-organization concerns. We describe a notation for describing the visual organization of a network diagram. This notation is used in reformulating the layout task as a constrained-optimization problem in which constraints are derived from a visual-organization specification and optimality criteria are derived from layout-aesthetic considerations. Two new heuristic algorithms are presented for this version of the layout problem: one algorithm uses a rule-based strategy for computing a layout; the other is a massively parallel genetic algorithm. We demonstrate the capabilities of the two algorithms by testing them on a variety of network-diagram-layout problems. %R TR-23-91 %D 1991 %T Minimizing the Communication Time for Matrix Multiplication on Multi-Processors %A S. Lennart Johnsson %X We present a few algorithms that allow concurrency in communication on multiple channels of multi-processors to be exploited for the multiplication of matrices of arbitrary shapes. For multi-processors configured as \ $ n-$dimensional Boolean cubes our algorithms offer a speedup of the communication over previous algorithms for square matrices and square cubes by a factor of \ $ \frac {n} {2} . $ \ We show that configuring \ $ N $ \ processors as a three-dimensional array may reduce the communication complexity by a factor of \ $ \sqrt[6] {N} $ \ compared to the two-dimensional partitioning. The best two- dimensional configuration of the multi-processor nodes has a ratio between the number of rows and columns equal to the ratio between the number of rows and columns of the product matrix. The optimum three dimensional configuration has a ratio between the length of the machine axes equal to the ratio between the length of the three axis in matrix multiplication. For product matrices of extreme shape a one-dimensional partitioning may be optimum. All presented algorithms use standard communication functions. %X tr-23-91.ps.gz %R TR-24-91 %D 1991 %T Cooley-Tukey FFT on the Connection Machine %A S. Lennart Johnsson %A Robert L. Krawitz %X We describe an implementation of the Cooley Tukey complex-to- complex FFT on the Connection Machine. The implementation is designed to make effective use of the communications bandwidth of the architecture, its memory bandwidth, and storage with precomputed twiddle factors. The peak data motion rate that is achieved for the interprocessor communication stages is in excess of 7 Gbytes/s for a Connection Machine system CM-200 with 2048 floating-point processors. The peak rate of FFT computations local to a processor is 12.9 Gflops/s in 32-bit precision, and 10.7 Gflops/s in 64-bit precision. The same FFT routine is used to perform both one- and multi-dimensional FFT without any explicit data rearrangement. The peak performance for one-dimensional FFT on data distributed over all processors is 5.4 Gflops/s in 32-bit precision and 3.2 Gflops/s in 64-bit precision. The peak performance for square, two-dimensional transforms, is 3.1 Gflops/s in 32-bit precision, and for cubic, three dimensional transforms, the peak is 2.0 Gflops/s in 64-bit precision. Certain oblong shapes yield better performance. The number of twiddle factors stored in each processor is \ $ \frac {P} {2N} + \log_ {2} N $ \ for an FFT on \ $ P $ \ complex points uniformly distributed among \ $ N $ \ processors. To achieve this level of storage efficiency we show that a decimation-in-time FFT is required for normal order input, and a decimation-in-frequency FFT is required for bit-reversed input order. %X tr-24-91.ps.gz %R TR-25-91 %D 1991 %T Communication Efficient Multi-processor FFT %A S. Lennart Johnsson %A Michel Jacquemin %A Robert L. Krawitz %X Computing the Fast Fourier Transform on a distributed memory architecture by a direct pipelined radix-2 algorithm, a bi-section or multi-section algorithm, all yield the same communications requirement, if communication for all FFT stages can be performed concurrently, the input date is in normal order, and the data allocation consecutive. With a cyclic data allocation, or bit- reversed input data and a consecutive allocation, multi-sectioning offers a reduced communications requirement by approximately a factor of two. For a consecutive data allocation, normal input order, a decimation-in-time FFT requires that \ $ \frac {P} {N} + d - 2 $ \ twiddle factors be stored for \ $ P $ \ elements distributed evenly over \ $ N $ \ processors and the axis subject to transformation distributed over \ $ 2^ {d} $ \ processors. No communication of twiddle factors is required. The same storage requirements hold for a decimation-in-frequency FFT, bit-reversed input order, and consecutive data allocation. The opposite combination of FFT type and data ordering requires a factor of \ $ \log_ {2} N $ \ more storage for \ $ N $ \ processors. The peak performance for a Connection Machine system CM-200 implementation is 12.9 Gflops/s in 32-bit precision, and 10.7 Gflops/s in 64-bit precision for unordered transforms local to each processor. The corresponding execution rates for ordered transforms are 11.1. Gflops/s and 8.5 Gflops/s, respectively. For distributed one- and two-dimensional transforms the peak performance for unordered transforms exceeds 5 Gflops/s in 32-bit precision, and 3 Gflops/s in 64-bit precision. Three-dimensional transforms executes at a slightly lower rate. Distributed ordered transforms executes at a rate of about \ $ \frac {1} {2} $ \ to \ $ \frac {2} {3} $ \ of the unordered transforms. %X tr-25-91.ps.gz %R TR-26-91 %D 1991 %T Learning Nonoverlapping Perceptron Networks From Examples and Membership Queries %A Thomas R. Hancock %A Mostefa Golea %A Mario Marchand %X We investigate, within the PAC learning model, the problem of learning nonoverlapping perceptron networks. These are loop-free neural nets in which each node has only one outgoing weight. We give a polynomial time algorithm that PAC learns any nonoverlapping perceptron network using examples and membership queries. The algorithm is able to identify both the architecture and the weight values necessary to represent the function to be learned. Our results shed some light on the effect of the overlap on the complexity of learning in neural networks. %R TR-27-91 %D 1991 %T Labeling Point Features on Maps and Diagrams Using Simulated Annealing %A Jon Christensen %A Joe Marks %A Stuart Shieber %X A major factor affecting the clarity of graphical displays is the degree to which text labels obscure display features (including other labels) as a result of spatial overlap. Point-feature label placement (PFLP) is the problem of placing text labels adjacent to point features on a map or diagram in order to maximize legibility. This problem arises for all kinds of informational graphics, though it is most often associated with automated cartography. In this paper we present a comprehensive treatment of the PFLP problem. First, we summarize some recent results regarding the computational complexity of PFLP. These results show that optimal PFLP is NP-hard. Second, we survey previously reported algorithms for PFLP. Third, we describe a stochastic- optimization method for PFLP, based on simulated annealing. Finally, we present the results of an empirical comparison of the known algorithms for PFLP. Our results indicate that the simulated-annealing approach to PFLP is superior to all existing methods, regardless of label density. %R TR-28-91 %D 1991 %T Linearizable Read/Write Objects %A Marios Mavronicolas %A Dan Roth %X We study the cost of implementing {\em linearizable} read/write objects for shared-memory multiprocessors and under various assumptions on the available timing information. We take as cost measure the {\em worst-case response time} of performing an operation in distributed implementations of virtual shared memory consisting of such objects and supporting linearizability. It is assumed that processes have clocks that run at the same rate as real time and all messages incur a delay in the range $[d-u,d]$ for some known constants $u$ and $d$, $0 \leq u \leq d$. In the {\em perfect clocks} model, where processes have perfectly synchronized clocks and every message incurs a delay of exactly $d$, we present a family of optimal linearizable implementations, parameterized by a constant $\beta$, $0 \leq \beta \leq 1$, for which the worst-case response times for read and write operations are $\beta d$ and $(1-\beta)d$, respectively. The parameter $\beta$ may be appropriately chosen to account for the relative frequencies of read and write operations. Our main result is the first known linearizable implementation for the {\em imperfect clocks} model where clocks are not initially synchronized and message delays can vary, i.e., $u > 0$; it achieves worst-case response times of less than $4u+b$ ($b>0$ is an arbitrarily small constant) and $d+3u$ for read and write operations, respectively. This implementation uses novel synchronization techniques to exploit the lower bound on message delay time and achieve bounds on worst-case response times that depend on the message delay uncertainty $u$. For a wide range of values of $u$, these bounds improve previously known ones for implementations that support consistency conditions even weaker than linearizability. %R TR-01-92 %D 1992 %T Multiplication of Matrices of Arbitrary Shape on a Data\\ Parallel Computer %A Kapil K. Mathur %A S. Lennart Johnsson %X Some level-2 and level-3 Distributed Basic Linear Algebra Subroutines (DBLAS) that have been implemented on the Connection Machine system CM-200 are described. For matrix-matrix multiplication, both the nonsystolic and the systolic algorithms are outlined. A systolic algorithm that computes the product matrix in-place is described in detail. All algorithms that are presented here are part of the Connection Machine Scientific Software Library, CMSSL. We show that a level-3 DBLAS yields better performance than a level-2 DBLAS. On the Connection Machine system CM-200, blocking yields a performance improvement by a factor of up to three over level-2 DBLAS. For certain matrix shapes the systolic algorithms offer both improved performance and significantly reduced temporary storage requirements compared to the nonsystolic block algorithms. The performance improvement over the blocked nonsystolic algorithms may be as much as a factor of seven, or more than a factor of 20 over the level-2 DBLAS. We show that, in order to minimize the communication time, an algorithm that leaves the largest operand matrix stationary should be chosen for matrix-matrix multiplication. Furthermore, it is shown both analytically and experimentally that the optimum shape of the processor array yields square stationary submatrices in each processor, i.e., the ratio between the length of the axes of the processing array must be the same as the ratio between the corresponding axes of the stationary matrix. The optimum processor array shape may yield a factor of five performance enhancement for the multiplication of square matrices. For rectangular matrices a factor of 30 improvement was observed for an optimum processor array shape compared to a poorly chosen processor array shape. %X tr-01-92.ps.gz %R TR-02-92 %D 1992 %T A Data Parallel Finite Element Method for Computational Fluid Dynamics on the Connection Machine System %A Zden\u { e } k Johan %A Thomas J.R. Hughes %A Kapil K. Mathur %A \\ %A S. Lennart Johnsson %X A finite element method for computational fluid dynamics has been implemented on the Connection Machine systems CM-2 and CM-200. An implicit iterative solution strategy, based on the preconditioned matrix-free GMRES algorithm, is employed. Parallel data structures built on both nodal and elemental sets are used to achieve maximum parallelization. Communication primitives provided through the Connection Machine Scientific Software Library substantially improved the overall performance of the program. Computations of three-dimensional compressible flows using unstructured meshes having close to one million elements, such as a complete airplane, demonstrate that the Connection Machine systems are suitable for these applications. Performance comparisons are also carried out with the vector computers Cray Y- MP and Convex C-1. %X tr-02-92.ps.gz %R TR-03-92 %D 1992 %T Efficiency of Semi-Synchronous versus Asynchronous Systems: Atomic Shared Memory %A Marios Mavronicolas %X The {\em $s$-session problem} is studied in {\em asynchronous} and {\em semi-synchronous} shared-memory systems, under a particular shared-memory communication primitive -- $k$-writer, $k$-reader atomic registers, -- where $k$ is a constant reflecting the communication bound in the model. A session is a part of an execution in which each of $n$ processes takes at least one step; an algorithm for the $s$-session problem guarantees the existence of at least $s$ disjoint sessions. The existence of many sessions guarantees a degree of interleaving which is necessary for certain computations. In the asynchronous model, it is assumed that the time between any two consecutive steps of any process is in the interval $[0,1]$; in the semi-synchronous model, the time between any two consecutive steps of any process is in the interval $[c,1]$ for some $c$ such that $0 < c \leq 1$, the {\em synchronous} model being the special case where $c=1$. All processes are initially synchronized and take a step at time $0$. Our main result is a tight (within constant factors) lower bound of $1 + \min \{\lfloor \frac {1} {2c} \rfloor, \lfloor \log_ {k} (n-1) - 1 \rfloor \} (s-2)$ for the time complexity of any semi- synchronous algorithm for the $s$-session problem. This result implies a time separation between semi-synchronous and asynchronous shared memory systems. %R TR-04-92 %D 1992 %T Block-Cyclic Dense Linear Algebra %A Woody Lichtenstein %A S. Lennart Johnsson %X Block-cyclic order elimination algorithms for LU and QR factorization and solve routines are described for distributed memory architectures with processing nodes configured as two- dimensional arrays of arbitrary shape. The cyclic order elimination together with a consecutive data allocation yields good load-balance for both the factorization and solution phases for the solution of dense systems of equations by LU and QR decomposition. Blocking may offer a substantial performance enhancement on architectures for which the level-2 or level-3 BLAS are ideal for operations local to a processing node. High rank updates local to a node may have a performance that is a factor of four or more higher than a rank-1 update. We show that in our implementation the \ $ O(N^ {2} )$ \ work in the factorization is of the same significance as the \ $ O(N^ {3} ) $ \ work, even for large matrices, because the \ $ O(N^ {2} ) $ \ work is poorly load -balanced in the two-dimensional processor array configuration. However, we show that the two-dimensional processor array configuration with consecutive data allocation and block-cyclic order elimination is optimal with respect to communication for a simple, but fairly general communications model. In our Connection Machine system CM-200 implementation, the peak performance for LU factorization is about 9.4 Gflops/s in 64-bit precision and 16 Gflops/s in 32-bit precision. Blocking offers an overall performance enhancement of about a factor of two. For the data motion, use is made of the fact that the nodes along each axis of the two-dimensional array are interconnected as Boolean cubes. %X tr-04-92.ps.gz %R TR-05-92 %D 1992 %T Efficient, Strongly Consistent Implementations\\ of Shared Memory %A Marios Mavronicolas %A Dan Roth %X We present two distributed organizations of multiprocessor shared memory and develop for them implementations that are shown to satisfy a strong consistency condition, namely {\em linearizability} , achieve improvements in efficiency over previous ones that support even weaker consistency conditions and possess other important, sought after properties that make them practically attractive. It is assumed throughout this paper that processes have clocks that run at the same rate as real time and all messages incur a delay in the range $[d-u,d]$ for some known constants $u$ and $d$, $0 \leq u \leq d$. The efficiency of an implementation is measured by the {\em worst-case response time} for performing an operation on an object. For the {\em full caching} organization, where each process keeps local copies of all objects, we present the first efficient linearizable implementation of read/write objects. The family of linearizable implementations we present is parameterized in a way that allows to degrade the less frequently employed operation, and is shown to be essentially optimal. For the {\em single ownership} organization, each shared object is ``owned'' by a single process, which is most likely to access it frequently. We present an implementation, that allows a process to access local information much faster (almost instantaneously) than it can access remote information, while still supporting linearizability. While the cost of the global operations depends on the maximal message delay $d$, the cost of the local operations depends only on the message delay uncertainty $u$. In both implementations, decisions made by individual processes do not make use of any communicated timing information. In particular, timing information is not part of the messages passed by our protocols, and those are of bounded size. These two organizations can be combined in a hierarchical memory structure, which supports linearizability very efficiently; this hybrid structure allows processes to access local and remote information in a transparent manner, while at a lower level of the memory consistency system, different portions of the memory, allocated a priory according to anticipated remote versus local use of the objects, employ the suitable, full caching or single ownership implementation. %R TR-06-92 %D 1992 %T Lecture Notes on Domain Theory %A Robert Muller %X This report contains a collection of lecture notes for a series of lectures introducing Scott's {\em domain theory} . The basic structure of semantic domains and fixpoint theory are introduced. The lecture notes are not intended to serve as a primary reference but rather as a supplement to a more comprehensive treatment. %R TR-07-92 %D 1992 %T Index Transformation Algorithms in a Linear\\ Algebra Framework %A Alan Edelman %A Steve Heller %A S. Lennart Johnsson %X We present a linear algebraic formulation for a class of index transformations such as Gray code encoding and decoding, matrix transpose, bit reversal, vector reversal, shuffles, and other index or dimension permutations. This formulation unifies, simplifies, and can be used to derive algorithms for hypercube multiprocessors. We show how all the widely known properties of Gray codes, and some not so well-known properties as well, can be derived using this framework. Using this framework, we relate hypercube communications algorithms to Gauss-Jordan elimination on a matrix of 0's and l's. %X tr-07-92.ps.gz %R TR-08-92 %D 1992 %T An Alternative Conception of Tree-Adjoining Derivation %A Yves Schabes %A Stuart M. Shieber %X The precise formulation of derivation for tree-adjoining grammars has important ramifications for a wide variety of uses of the formalism, from syntactic analysis to semantic interpretation and statistical language modeling. We argue that the definition of tree-adjoining derivation must be reformulated in order to manifest the proper linguistic dependencies in derivations. The particular proposal is both precisely characterizable, through a compilation to linear indexed grammars, and computationally operational, by virtue of an efficient algorithm for recognition and parsing. %X tr-08-92.ps.gz %R TR-09-92 %D 1992 %T Local Basic Linear Algebra Subroutines (LBLAS) for\\ Distributed Memory Architectures and Languages with\\ Array Syntax %A S. Lennart Johnsson %A Luis F. Ortiz %X We describe a subset of the level-1, level-2, and level-3 BLAS implemented for each node of the Connection Machine system CM-200 and with a set of interfaces consistent with Fortran 90. The implementation performs computations on multiple instances in a single call to a routine. The strides for the different axes are derived from an array descriptor that contains information about the length of the axes, the number of instances and their allocation in the machine. Another novel feature of our implementation of the BLAS in each node is a selection of loop order for rank-1 updates and matrix-matrix multiplication based upon array shapes, strides, and DRAM page faults. The peak efficiencies for the routines are in the range 75\% to 90\%. The optimization of loop ordering has a success rate exceeding 99.8\% for matrices for which the sum of the length of the axes is at most 60. The success rate is even higher for all possible matrix shapes. The performance loss when a nonoptimal choice is made is less than $ \sim$15\% of peak, and typically less l\% of peak. We also show that the performance gain for high rank updates may be as much as a factor of 6 over rank-1 updates. %X tr-09-92.ps.gz %R TR-10-92 %D 1992 %T Direct Bulk-Synchronous Parallel Algorithms %A Alexandros V. Gerbessiotis %A Leslie G. Valiant %X We describe a methodology for constructing parallel algorithms that are transportable among parallel computers having different numbers of processors, different bandwidths of interprocessor communication and different periodicity of global synchronization. We do this for the bulk-synchronous parallel (BSP) model, which abstracts the characteristics of a parallel machine into three numerical parameters $p$, $g$, and $L$, corresponding to processors, bandwidth, and periodicity respectively. The model differentiates memory that is local to a processor from that which is not, but, for the sake of universality, does not differentiate network proximity. The advantages of this model in supporting shared memory or PRAM style programming have been treated elsewhere. Here we emphasize the viability of an alternative direct style of programming where, for the sake of efficiency the programmer retains control of memory allocation. We show that optimality to within a multiplicative factor close to one can be achieved for the problems of Gauss-Jordan elimination and sorting, by transportable algorithms that can be applied for a wide range of values of the parameters $p$, $g$, and $L$. We also give some simulation results for PRAMs on the BSP to identify the level of slack at which corresponding efficiencies can be approached by shared memory simulations, provided the bandwidth parameter $g$ is good enough. %X tr-10-92.ps.gz %R TR-11-92 %D 1992 %T Deriving Parallel and Systolic Programs from Data Dependence %A Lilei Chen %X We present an algorithm that statically sequences data computations and communications for parallel and systolic executions. Instead of searching for implicit parallelism in a functional or sequential program, the algorithm looks for sequence requirements imposed by the data dependence and by the communication delays. It achieves its efficiency by analyzing the sequence constraints. The actual sequence of the parallel computations can be decided at the last stage to tailor to the specific parallel machine. In addition, once the processor mapping is decided, the data communication delay is combined with the computation sequence to obtain the final scheduling of the computations and communications. As a result, the parallel implementation fully explores the parallelism in the original program and effectively schedules the computations and minimizes the communication cost by systolic design. Because the algorithm is only based on the data dependence from the original program, it can be applied to a wide variety of program forms, from sequential loop programs with updates, to recursive equation sets. It can detect parallelism in sequential programs, as well as provide efficient implementations for recurrence statements in equation sets. %R TR-12-92 %D 1992 %T Algebraic Optimization of Outerjoin Queries %A C\'{e}sar Alejandro Galindo-Legaria %X The purpose of this thesis is to extend database optimization techniques for joins to queries that contain both joins and outerjoins. The benefits of query optimization are thus extended to a number of important applications, such as federated databases, nested queries, and hierarchical views, for which outerjoin is a key component. Our analysis of join/outerjoin queries is done in two parts. First, we investigate the interaction of outerjoin with other relational operators, to find simplification rules and associativity identities. Our approach is comprehensive and includes, as special cases, some outerjoin optimization heuristics that have appeared in the literature. Second, we abstract the notion of feasible evaluation order for binary, join-like operators, considering associativity rules but not specific operator semantics. Combining these two parts, we show that a join/outerjoin query can be evaluated by combining relations in any given order ---just as it is done on join queries, except that now we need to synthesize an operator to use at each step, rather than using always join. The purpose of changing the order of processing of relations is to reduce the size of intermediate results. Our theoretical results are converted into algorithms compatible with the architecture of conventional database optimizers. For optimizers that repeatedly transform feasible strategies, the outerjoin identities we have identified can be applied directly. Those identities are sufficient to obtain all possible orders of processing. For optimizers that generate join programs bottom-up, we give a rule to determine the operators to use at each step. %X tr-12-92.ps.gz %R TR-13-92 %D 1992 %T Timing-Based, Distributed Computation:\\ Algorithms and Impossibility Results %A Marios Mavronicolas %X Real distributed systems are subject to timing uncertainties: processes may lack a common notion of real time, or may even have only inexact information about the amount of real time needed for performing primitive computation steps. In this thesis, we embark on a study of the complexity theory of such systems and present combinatorial results that determine the inherent costs of some accomplishable tasks. We first consider {\em continuous-time} models, where processes obtain timing information from continuous- time clocks that run at the same rate as real time, but might not be initially synchronized. Due to an uncertainty in message delay time, absolute process synchronization is known to be impossible for such systems. We develop novel synchronization schemes for such systems and use them for building a distributed, {\em full caching} implementation of shared memory that supports {\em linearizability} . This implementation improves in efficiency over previous ones that support consistency conditions even weaker than linearizability and supports a quantitative degradation of the less frequently occurring operation. We present lower bound results which show that our implementation achieves efficiency close to optimal. We next turn to {\em discrete-time} models, where the time between any two consecutive steps of a process is in the interval $[c,1]$, for some constant $c$ such that $0 \leq c \leq 1$. We show time separation results between {\em asynchronous} and {\em semi-synchronous} such models, defined by taking $c=0$ and $c > 0$, respectively. Specifically, we use the {\em session problem} to show that the semi-synchronous model, for which the timing uncertainty, $\frac {1} {c} $, is bounded, is strictly more powerful than the asynchronous one under either message-passing or shared-memory interprocess communication. We also present tight lower and upper bounds on the degree of {\em precision} that can be achieved in the semi-synchronous model. Our combinatorial results shed some light on the capabilities and limitations of distributed systems subject to timing uncertainties. In particular, the main argument of this thesis is that the goal of designing distributed algorithms so that their logical correctness is timing-independent, whereas their performance might depend on timing assumptions, will not always be achievable: for some tasks, the only practical solutions might be strongly timing-dependent. %R TR-14-92 %D 1992 %T An Upper and a Lower Bound for Tick Synchronization %A Marios Mavronicolas %X The {\em tick synchronization problem} is defined and studied in the {\em semi-synchronous} network, where $n$ processes are located at the nodes of a complete graph and communicate by sending messages along links that correspond to its edges. An algorithm for the tick synchronization problem brings each process into a synchronized state in which the process makes an estimate of real time that is close enough to those of other processes already in a synchronized state. It is assumed that the (real) time for message delivery is at most $d$ and the time between any two consecutive steps of any process is in the interval $[c,1]$ for some $c$ such that $0 < c \leq 1$. We define the {\em precision} of a tick synchronization algorithm to be the maximum difference between estimates of real time made by different processes in a synchronized state, and propose it as a worst-case performance measure. We show that no such algorithm can guarantee precision less than $\lfloor \frac {d-2} {2c} \rfloor$. We also present an algorithm which achieves a precision of $\frac {2(n-1)} {n} (\lceil \frac {2d} {c} \rceil + \frac {d} {2} ) + \frac {1-c} {c} d+1$. %R TR-15-92 %D 1992 %T The Complexity of Learning Formulas and Decision Trees that have Restricted Reads %A Thomas Raysor Hancock %X Many learning problems can be phrased in terms of finding a close approximation to some unknown target formula $f$, based on observing $f$'s value on a sample of points either drawn at random according to some underlying distribution, or perhaps selected by a learner for algorithmic reasons. In this research our goal is to prove theorems about what classes of formulas permit such learning in polynomial time (using the definitions of either Valiant's PAC model or Angluin's exact identification model). In particular we take powerful classes of formulas whose learnability is unknown or provably intractable, and then consider restricted cases where the number of different times a single variable may appear in the formula is limited to a small constant. We prove positive learnability results in several such cases, given either added assumptions on the underlying distribution of random points or the ability of the learner to select some of the sample points. We provide polynomial time learning algorithms for decision trees and monotone disjunctive normal form (DNF) formulas when variables appear at most some arbitrary constant number of times, given that the sample points are chosen uniformly. Over arbitrary distributions, we show algorithms that chose their own sample points, besides using random examples, to closely approximate the same class of decision trees and the class of DNF formulas where variables appear at most twice. For arbitrary formulas, we give a number of algorithms for the read-once case (where variables appear only once) over different bases (the functions computed at formula's nodes). Besides identification algorithms for large classes of boolean read-once formulas, these results include new interpolation algorithms for classes of rational functions, and a membership query algorithm for a new class of neural networks. %R TR-16-92 %D 1992 %T Optimal Communication Channel Utilization for \\ Matrix Transposition and Related Permutations\\ on Binary Cubes %A S. Lennart Johnsson %A Ching-Tien Ho %X We present optimal schedules for permutations in which each node sends one or several unique messages to every other node. With concurrent communication on all channels of every node in binary cube networks, the number of element transfers in sequence for \ $ K $ \ elements per node is \ $ {K \over 2} , $ \ irrespective of the number of nodes over which the data set is distributed. For a succession of \ $ s $ \ permutations within disjoint subcubes of \ $ d $ \ dimensions each, our schedules yield \ $ \min ( {K \over 2} + (s - 1)d,(s + 3)d, {K \over 2} + 2 d) $ \ exchanges in sequence. The algorithms can be organized to avoid indirect addressing in the internode data exchanges, a property that increases the performance on some architectures. For message passing communication libraries, we present a blocking procedure that minimizes the number of block transfers while preserving the utilization of the communication channels. For schedules with optimal channel utilization, the number of block transfers for a binary \ $ d-$cube is \ $ d. $ The maximum block size for \ $ K $ \ elements per node is \ $ \lceil {K \over 2d} \rceil. $ %X tr-16-92.ps.gz %R TR-17-92 %D 1992 %T Parallel Sets: An Object-Oriented Methodology for Massively Parallel Programming %A Michael Francis Kilian %X Parallel programming has become the focus of much research in the past decade. As the limits of VLSI technology are tested, it becomes more apparent that parallel processors will be responsible for the next quantum leap in performance. Already parallel programming is responsible for significant advances not so much in the speed of solving problems, but in the size of problems that can be solved. Carefully crafted parallel programs are solving problems magnitudes larger than could be considered for serial machines. Object-oriented programming has also become popular in academia and perhaps even moreso in industry. O-O holds out the promise of being able to efficiently build large systems that are understandable, maintainable, and more robust. The programs targetted by O-O are different than those typically found running on a computer such as the Connection Machine. Parallel programs are often designed for very specific tasks; O-O programs' strengths are that they handle a wide variety of requirements. The thesis proposed here is that an object-oriented model of programming can be developed that is suitable for massively parallel processors. A set of criteria are developed for object- oriented parallel programming models and existing models are evaluated using this criteria. Given these criteria, the thesis presents a new way of thinking of parallel programs that builds upon an object-oriented foundation. A new basic type is added to the object model called Parallel-Set. Parallel sets are rigorously defined and then used to express complex communication between objects. The communication model is then extended to allow communication and synchronization protocols to be developed. The contribution of this work is that a wider range of reliable programs can be designed for use on parallel computers and that these programs will be easier to construct and understand. %R TR-18-92 %D 1992 %T Language and Compiler Issues in Scalable High\\ Performance Scientific Libraries %A S. Lennart Johnsson %X Library functions for scalable architectures must be designed to correctly and efficiently support any distributed data structure that can be created with the supported languages and associated compiler directives. Libraries must be designed also to support concurrency in each function evaluation, as well as the concurrent application of the functions to disjoint array segments, known as {\em multiple-instance} computation. Control over the data distribution is often critical for locality of reference, and so is the control over the interprocessor data motion. Scalability, while preserving efficiency, implies that the data distribution, the data motion, and the scheduling is adapted to the object shapes, the machine configuration, and the size of the objects relative to the machine size. The Connection Machine Scientific Software Library is a scalable library for distributed data structures. The library is designed for languages with an array syntax. It is accessible from all supported languages ($\ast$Lisp, C$\ast$,CM-Fortran, and Paris (PARalel Instruction Set) in combination with Lisp, C, and Fortran 77). Single library calls can manage both concurrent application of a function to disjoint array segments, as well as concurrency in each application of a function. The control of the concurrency is independent of the control constructs provided in the high-level languages. Library functions operate efficiently on any distributed data structure that can be defined in the high-level languages and associated directives. Routines may use their own internal data distribution for efficiency reasons. The algorithm invoked by a call to a library function depends upon the shapes of the objects involved, their sizes and distribution, and upon the machine shape and size. %X tr-18-92.ps.gz %R TR-19-92 %D 1992 %T Lessons from a Restricted Turing Test %A Stuart M. Shieber %X We report on the recent Loebner prize competition inspired by Turing's test of intelligent behavior. The presentation covers the structure of the competition and the outcome of its first instantiation in an actual event, and an analysis of the purpose, design, and appropriateness of such a competition. We argue that the competition has no clear purpose, that its design prevents any useful outcome, and that such a competition is inappropriate given the current level of technology. We then speculate as to suitable alternatives to the Loebner prize. %X tr-19-92.ps.gz %R TR-20-92 %D 1992 %T An Efficient Algorithm for Gray-to-Binary Permutation\\ on Hypercubes %A Ching-Tien Ho %A M.T. Raghunath %A S. Lennart Johnsson %X Both Gray code and binary code are frequently used in mapping arrays into hypercube architectures. While the former is preferred when communication between adjacent array elements is needed, the latter is preferred for FFT-type communication. When different phases of computations have different types of communication patterns, the need arises to remap the data. We give a nearly optimal algorithm for permuting data from a Gray code mapping to a binary code mapping on a hypercube with communication restricted to one input and one output channel per node at a time. Our algorithm improves over the best previously known algorithm [6] by nearly a factor of two and is optimal to within a factor of \ $ n/(n-1) $ \ with respect to data transfer time on an \ $n$-cube. The expected speedup is confirmed by measurements on an Intel iPSC/2 hypercube. %X tr-20-92.ps.gz %R TR-21-92 %D 1992 %T Physically Realistic Trajectory Planning in Animation:\\ A Stimulus-Response Approach %A J. Thomas Ngo %A Joe Marks %X Trajectory-planning problems arise in subject to physical law other constraints on their motion. Witkin Kass dubbed this class of problems ``Spacetime Constraints'' (SC) presented results for specific problems involving an articulated f SC problems are typically multimodal discontinuous the number of decision alternatives available at each time step ca constructing even coarse trajectories for subsequent optimization without directive input from the user. Rather than use a time-dom which might be appropriate for local optimization our algorithm uses a stimulus-response model. Locomotive skills a which chooses stimulus-response parameters using a parallel geneti succeeds in finding good novel solutions for a test suite of SC problems involving unbranch %R TR-22-92 %D 1992 %T Pronouns, Names, and the Centering of Attention in Discourse %A Peter C. Gordon %A Barbara J. Grosz %A Laura A. Gilliom %X Centering theory, developed within computational linguistics, provides an account of ways in which patterns of inter-utterance reference can promote the local coherence of discourse. It states that each utterance in a coherent discourse segment contains a single semantic entity -- the backward-looking center -- that provides a link to the previous utterance, and an ordered set of entities -- the forward-looking centers -- that offer potential links to the next utterance. We report five reading-time experiments that test predictions of this theory with respect to the conditions under which it is preferable to realize (refer to) an entity using a pronoun rather than a repeated definite description or name. The experiments show that there is a single backward-looking center that is preferentially realized as a pronoun, and that the backward- looking center is typically realized as the grammatical subject of the utterance. They also provide evidence that there is a set of forward-looking centers that is ranked in terms of prominence and that a key factor in determining prominence, surface-initial position, does not affect determination of the backward-looking center. This provides evidence for the dissociation of the coherence processes of looking backward and looking forward. %R TR-23-92 %D 1992 %T Communication Primitives for Unstructured Finite Element Simulations on Data Parallel Architectures %A Kapil K. Mathur %A S. Lennart Johnsson %X Efficient data motion is critical for high performance computing on distributed memory architectures. The value of some techniques for efficient data motion is illustrated by identifying generic communication primitives. Further, the efficiency of these primitives is demonstrated on three different applications using the finite element method for unstructured grids and sparse solvers with different communication requirements. For the applications presented, the techniques advocated reduced the communication times by a factor of between 1.5 - 3. %X tr-23-92.ps.gz %R TR-24-92 %D 1992 %T A Combining Mechanism for Parallel Computers %A Leslie G. Valiant %X In a multiprocessor computer communication among the components may be based either on a simple router, which delivers messages point-to-point like a mail service, or on a more elaborate combining network that, in return for a greater investment in hardware, can combine messages to the same address prior to delivery. This paper describes a mechanism for recirculating messages in a simple router so that the added functionality of a combining network, for arbitrary access patterns, can be achieved by it with reasonable efficiency. The method brings together the messages with the same destination address in more than one stage, and at a set of components that is determined by a hash function and decreases in number at each stage. %X tr-24-92.ps.gz %R TR-25-92 %D 1992 %T Labeling Point Features on Maps and Diagrams %A Jon Christensen %A Joe Marks %A Stuart Shieber %X (Revised 6/94; includes color figures.) A major factor affecting the clarity of graphical displays that include text labels is the degree to which labels obscure display features (including other labels) as a result of spatial overlap. Point-feature label placement (PFLP) is the problem of placing text labels adjacent to point features on a map or diagram so as to maximize legibility. This problem occurs frequently in the production of many types of informational graphics, though it arises most often in automated cartography. In this paper we present a comprehensive treatment of the PFLP problem, viewed as a type of combinatorial optimization problem. Complexity analysis reveals that the basic PFLP problem and most interesting variants of it are NP-hard. These negative results help inform a survey of previously reported algorithms for PFLP; not surprisingly, all such algorithms either have exponential time complexity or are incomplete. To solve the PFLP problem in practice, then, we must rely on good heuristic methods. We pro pose two new methods, one based on a discrete form of gradient descent, the other on simulated annealing, and report on a series of empirical tests comparing these and the other known algorithms for the problem. Based on this study, the first to be conducted, we identify the best approaches as a function of available computation time. %X tr-25-92.ps.gz %R TR-26-92 %D 1992 %T Why BSP Computers? %A L.G. Valiant %R TR-27-92 %D 1992 %T Variations on Incremental Interpretation %A Stuart M. Shieber %A Mark Johnson %X tr-27-92.ps.gz %R TR-28-92 %D 1992 %T A Fixpoint Theory of Nonmonotonic Functions and \\ Its Applications to Logic Programs, Deductive Databases\\ and %A Yuli Zhou %X In this thesis we shall employ denotational (fixpoint) methods to study the computations of rule systems based on first order logic. The resulting theory parallels and further strengthens the fixpoint theory of {\it stratified logic programs} developed by Apt, Blair and Walker, and we shall consider two principal applications of the theory to logic programs and to production rule systems. %R TR-29-92 %D 1992 %T Massively Parallel Computing: Data distribution and\\ communication %A S. Lennart Johnsson %X We discuss some techniques for preserving locality of reference in index spaces when mapped to memory units in a distributed memory architecture. In particular, we discuss the use of multidimensional address spaces instead of linearized address spaces, partitioning of irregular grids, and placement of partitions among nodes. We also discuss a set of communication primitives we have found very useful on the Connection Machine systems in implementing scientific and engineering applications. We briefly review some of the techniques used to fully utilize the bandwidth of the binary cube network of the CM-2 and CM-200, and give some performance data from implementations of commumication primitives. %X tr-29-92.ps.gz %R TR-30-92 %D 1992 %T Optimal Computing on Mesh-Connected Processor Arrays %A Christos Ioannis Kaklamanis %X In this thesis, we present and analyze new algorithms for routing, sorting and dynamic searching on mesh-connected arrays of processors; also we present a lower bound concerning embeddings on faulty arrays. In particular, we first consider the problem of permutation routing in two- and three-dimensional mesh-connected processor arrays. We present new on-line and off-line routing algorithms, all of which are optimal to within a small additive term. Then, we show that sorting an input of size $N = n^2$ can be performed by an $n \times n$ mesh-connected processor array in $2n + o(n)$ parallel communication steps and using constant-size queues, with high probability. This result is optimal to within a low order additive term, realizing the obvious diameter lower bound. Our techniques can be applied to higher dimensional meshes as well as torus-connected networks, achieving significantly better bounds than the known results. Furthermore, we investigate the parallel complexity of the backtrack and branch-and-bound search on the mesh-connected array. We present an $\Omega(\sqrt {dN} /\sqrt {\log N} )$ lower bound for the time needed by a {\em randomized} algorithm to perform backtrack and branch-and-bound search of a tree of depth $d$ on the $\sqrt {N} \times \sqrt {N} $ mesh, even when the depth of the tree is known in advance. For the upper bounds we give {\em deterministic} algorithms that are within a factor of $O(\log^ {{3} \over {2}} N)$ from our lower bound. Our algorithms do not make any assumption on the shape of the tree to be searched. Our algorithm for branch-and-bound is the first algorithm that performs branch-and-bound search on a sparse network. Both the lower and the upper bounds extend to higher dimension meshes. %R TR-31-92 %D 1992 %T An Algebraic Approach to the Compilation and Operational\\ Semantics of Functional Languages with I-structures %A Zena Matilde Ariola %X Modern languages are too complex to be given direct operational semantics. For example, the operational semantics of functional languages has traditionally been given by translating them to the $\lambda$-calculus extended with constants. Compilers do a similar translation into an intermediate form in the process of generating code for a machine. A compiler then performs optimizations on this intermediate form before generating machine code. In this thesis we show that the intermediate form can actually be the kernel language. In fact, we may translate the kernel language into still lower-level language(s), where more machine oriented or efficiency related concerns can be expressed directly. Furthermore, compiler optimizations may be expressed as source-to-source transformations on the intermediate languages. We introduce two implicitly parallel languages, Kid (Kernel Id) and P-TAC (Parallel Three Address Code), respectively, and describe the compilation process of Id in terms of a translation of Id into Kid, and of Kid into P-TAC\@. In this thesis we do not describe the compilation process below the P-TAC level. However, we show that our compilation process allows the formalization of questions related to the correctness of the optimizations. We also give the operational semantics of Id indirectly by its translation into Kid and a well-defined operational semantics for Kid. Kid and P-TAC are examples of Graph Rewriting Systems (GRSs), which are introduced to capture sharing of computation precisely. Sharing of subexpressions is important both semantically (e.g., to model side-effects) and pragmatically (e.g., to reason about complexity). Our GRSs extend Barendregt's Term Graph Rewriting Systems to include cyclic graphs and cyclic rules. We present a term model for GRSs along the lines of L\'evy's term model for $\lambda$-calculus, and show its application to compiler optimizations. We also show that GRS reduction is a correct implementation of term rewriting. %R TR-01-93 %D 1993 %T Massively Parallel Computing: \\ Mathematics and communications libraries %A S. Lennart Johnsson %A Kapil K. Mathur %X Massively parallel computing holds the promise of extreme performance. The utility of these systems will depend heavily upon the availability of libraries until compilation and run-time system technology is developed to a level comparable to what today is common on most uniprocessor systems. Critical for performance is the ability to exploit locality of reference and effective management of the communication resources. We discuss some techniques for preserving locality of reference in distributed memory architectures. In particular, we discuss the benefits of multidimensional address spaces instead of the conventional linearized address spaces, partitioning of irregular grids, and placement of partitions among nodes. Some of these techniques are supported as language directives, others as run-time system functions, and others still are part of the Connection Machine Scientific Software Library, CMSSL. We briefly discuss some of the unique design issues in this library for distributed memory architectures, and some of the novel ideas with respect to managing data allocation, and automatic selection of algorithms with respect to performance. The CMSSL also includes a set of communication primitives we have found very useful on the Connection Machine systems in implementing scientific and engineering applications. We briefly review some of the techniques used to fully utilize the bandwidth of the binary cube network of the CM-2 and CM-200 Connection Machine systems. %X tr-01-93.ps.gz %R TR-02-93 %D 1993 %T All-to-All Communication on the Connection Machine CM-200 %A Kapil K. Mathur %A S. Lennart Johnsson %X Detailed algorithms for all-to-all broadcast and reduction are given for arrays mapped by binary or binary-reflected Gray code encoding to the processing nodes of binary cube networks. Algorithms are also given for the local computation of the array indices for the communicated data, thereby reducing the demand for communications bandwidth. For the Connection Machine system CM- 200, Hamiltonian cycle based all-to-all communication algorithms yields a performance that is a factor of two to ten higher than the performance offered by algorithms based on trees, butterfly networks, or the Connection Machine router. The peak data rate achieved for all-to-all broadcast on a 2048 node Connection Machine system CM-200 is 5.4 Gbytes/sec when no reordering is required. If the time for data reordering is included, then the effective peak data rate is reduced to 2.5 Gbytes/sec. %X tr-02-93.ps.gz %R TR-03-93 %D 1993 %T Topics in Parallel and Distributed Computation %A Alexandros Gerbessiotis %X With advances in communication technology, the introduction of multiple-instruction multiple-data parallel computers and the increasing interest in neural networks, the fields of parallel and distributed computation have received increasing attention in recent years. We study in this work the bulk-synchronous parallel model, that attempts to bridge the software and hardware worlds with respect to parallel computing. It offers a high level abstraction of the hardware with the purpose of allowing parallel programs to run efficiently on diverse hardware platforms. We examine direct algorithms on this model and also give simulations of other models of parallel computation on this one as well as on models that bypass it. While the term parallel computation refers to the execution of a single or a set of closely coupled tasks by a set of processors, the term distributed computation refers to more loosely coupled or uncoupled tasks being executed at different locations. In a distributed computing environment it is sometimes necessary that one computer send to the remaining ones various pieces of information. The term broadcasting is used to describe the dissemination of information from one computer to the others in such an environment. We examine various classes of random graphs with respect to broadcasting and establish results related to the minimum time required to perform broadcasting from any vertex of such graphs. Various models of the human brain, as a collection of distributed elements working in parallel, have been proposed. Such elements are connected together in a network. The network in the human brain is sparse. How such sparse networks of simple elements can perform any useful computation is a topic currently little understood. We examine in this work a graph construction problem as it relates to neuron allocation in a model of neural networks recently proposed. We also examine a certain class of random graphs with respect to this problem and establish various results related to the distribution of the sizes of sets of neurons when various learning tasks are performed on this model. Experimental results are also presented and compared to theoretically derived ones. %R TR-04-93 %D 1993 %T Infrastructure for Research towards Ubiquitous Information\\ Systems %A Barbara Grosz %A H.T. Kung %A Margo Seltzer %A Stuart Shieber\\ %A Michael Smith %R TR-05-93 %D 1993 %T An Equational Framework for the Flow Analysis of Higher Order Functional Programs %A Dan Stefanescu %A Yuli Zhou %X This paper presents a novel technique for the static analysis of functional programs. The method uses the original Cousots' framework expanded by a syntactic based abstraction methodology. The main idea is to represent each computational entity in a functional program in relation to its (concrete) call string, i.e. the string of function calls leading to its computation. Furthermore, the abstraction criteria consists in choosing a relation of equivalence over the set of all call strings. Based on this relation of equivalence, the method generates a monotonic system of equations such that its least solution is the desired result of the analysis. This approach generalizes previous techniques (0CFA, 1CFA, etc.) in flow analyses and allows for program-directed design of frameworks for approximate analysis of programs. The method is proven correct with respect to a rewriting system based operational semantics. %R TR-06-93 %D 1993 %T An Efficient Communication Strategy for Finite Element\\ Methods on the Connection Machine CM-5 System %A Zden\v { e } k Johan %A Kapil K. Mathur %A S. Lennart Johnsson %A Thomas J.R. Hughes %X Performance of finite element solvers on parallel computers such as the Connection Machine CM-5 system is directly related to the efficiency of the communication strategy. The objective of this work is two-fold: First, we propose a data-parallel implementation of a partitioning algorithm used to decompose unstructured meshes. The mesh partitions are then mapped to the vector units of the CM- 5. Second, we design gather and scatter operations taking advantage of data locality coming from the decomposition to reduce the communication time. This new communication strategy is available in the CMSSL [8]. An example illustrates the performance of the proposed strategy. %X tr-06-93.ps.gz %R TR-07-93 %D 1993 %T All-to-All Communication Algorithms for Distributed BLAS %A Kapil K. Mathur %A S. Lennart Johnsson %X Dense Distributed Basic Linear Algebra Subroutine (DBLAS) algorithms based on all-to-all broadcast and all-to-all reduce are presented. For DBLAS, at each all-to-all step, it is necessary to know the data values and the indices of the data values as well. This is in contrast to the more traditional applications of all-to -all broadcast (such as a N-body solver) where the identity of the data values is not of much interest. Detailed schedules for all- to-all broadcast and reduction are given for the data motion of arrays mapped to the processing nodes of binary cube networks using binary encoding and binary-reflected Gray encoding. The algorithms compute the indices for the communicated data locally. No communication bandwidth is consumed for data array indices. For the Connection Machine system CM-200, Hamiltonian cycle based all-to-all communication algorithms improves the performance by a factor of two to ten over a combination of tree, butterfly network, and router based algorithms. The data rate achieved for all-to-all broadcast on a 256 node Connection Machine system CM- 200 is 0.3 Gbytes/sec. The data motion rate for all-to-all broadcast, including the time for index computations and local data reordering, is about 2.8 Gbytes/sec for a 2048 node system. Excluding the time for index computation and local memory reordering the measured data motion rate for all-to-all broadcast is 5.6 Gbytes/s. On a Connection Machine system, CM-200, with 2048 processing nodes, the overall performance of the distributed matrix vector multiply (DGEMV) and vector matrix multiply (DGEMV with TRANS) is 10.5 Gflops/s and 13.7 Gflops/s respectively. %X tr-07-93.ps.gz %R TR-08-93 %D 1993 %T Massively Parallel Computing: \\ Unstructured Finite Element Simulations %A Kapil K. Mathur %A Zden\v { e } k Johan %A S. Lennart Johnsson %A Thomas J.R. Hughes %X Massively parallel computing holds the promise of extreme performance. Critical for achieving high performance is the ability to exploit locality of reference and effective management of the communication resources. This article describes two communication primitives and associated mapping strategies that have been used for several different unstructured, three- dimensional, finite element applications in computational fluid dynamics and structural mechanics. %X tr-08-93.ps.gz %R TR-09-93 %D 1993 %T The Connection Machine Systems CM-5 %A S. Lennart Johnsson %X The Connection Machine system CM-5 is a parallel computing system scalable to Tflop performance, hundreds of Gbytes of primary storage, Tbytes of secondary storage and Gbytes/s of I/O bandwidth. The system has been designed to be scalable over a range of up to three orders of magnitude. We will discuss the design goals, innovative software and hardware features of the CM- 5 system, and some experience with the system. %X tr-09-93.ps.gz %R TR-10-93 %D 1993 %T Constraint-Driven Diagram Layout %A Ed Dengler %A Mark Friedell %A Joe Marks %X Taking both perceptual organization and aesthetic criteria into account is the key to high-quality diagram layout, but makes for a more difficult problem than pure aesthetic layout. Computing the layout of a network diagram that exhibits a specified perceptual organization can be phrased as a constraint-satisfaction problem. Some constraints are derived from the perceptual-organization specification: the nodes in the diagram must be positioned so that they form specified perceptual gestalts, i.e., certain groups of nodes must form perceptual groupings by proximity, or symmetry, or shape motif, etc. Additional constraints are derived from aesthetic considerations: the layout should satisfy criteria that concern the number of link crossings, the sum of link lengths, or diagram area, etc. Using a generalization of a simple mass-spring layout technique to ``satisfice'' constraints, we show how to produce high-quality layouts with specified perceptual organization for medium-sized diagrams (10--30 nodes) in under 30 seconds on a workstation. %X tr-10-93.ps.gz %R TR-11-93 %D 1993 %T An Efficient Communication Strategy for Finite Element\\ Methods on the Connection Machine CM-5 System %A Zden\v { e } k Johan %A Kapil K. Mathur %A S. Lennart Johnsson %A Thomas J.R. Hughes %X The objective of this paper is to propose communication procedures suitable for unstructured finite element solvers implemented on distributed-memory parallel computers such as the Connection Machine CM-5 system. First, a data-parallel implementation of the recursive spectral bisection (RSB) algorithm proposed by Pothen {\em et al.} is presented. The RSB algorithm is associated with a node renumbering scheme which improves data locality of reference. Two-step gather and scatter operations taking advantage of this data locality are then designed. These communication primitives make use of the indirect addressing capability of the CM-5 vector units to achieve high gather and scatter bandwidths. The efficiency of the proposed communication strategy is illustrated on large-scale three-dimensional fluid dynamics problems. %X tr-11-93.ps.gz %R TR-12-93 %D 1993 %T Aligning Sentences in Bilingual Corpora Using Lexical\\ Information %A Stanley F. Chen %X In this paper, we describe a fast algorithm for aligning sentences with their translations in a bilingual corpus. Existing efficient algorithms ignore word identities and only consider sentence length (Brown {\it et al.} , 1991; Gale and Church 1991). Our algorithm constructs a simple statistical word-to-word translation model on the fly during alignment. We find the alignment that maximizes the probability of generating the corpus with this translation model. We have achieved an error rate of approximately 0.4\% on Canadian Hansard data, which is a significant improvement over previous results. The algorithm is language independent. %R TR-13-93 %D 1993 %T Compaction and Separation Algorithms for Non-Convex Polygons and Their Applications %A Zhenyu Li %A Victor Milenkovic %X Given a two dimensional, non-overlapping layout of convex and non- convex polygons, compaction can be thought of as simulating the motion of the polygons as a result of applied ``forces.'' We apply compaction to improve the material utilization of an already tightly packed layout. Compaction can be modeled as a motion of the polygons that reduces the value of some functional on their positions. Optimal compaction, planning a motion that reaches a layout that has the global minimum functional value among all reachable layouts, is shown to be NP-complete under certain assumptions. We first present a compaction algorithm based on existing physical simulation approaches. This algorithm uses a new velocity-based optimization model. Our experimental results reveal the limitation of physical simulation: even though our new model improves the running time of our algorithm over previous simulation algorithms, the algorithm still can not compact typical layouts of one hundred or more polygons in a reasonable amount of time. The essential difficulty of physical based models is that they can only generate velocities for the polygons, and the final positions must be generated by numerical integration. We present a new position-based optimization model that allows us to calculate directly new polygon positions via linear programming that are at a local minimum of the objective. The new model yields a translational compaction algorithm that runs two orders of magnitude faster than physical simulation methods. We also consider the problem of separating overlapping polygons using a minimal amount of motion and show it to be NP-complete. Although this separation problem looks quite different from the compaction problem, our new model also yields an efficient algorithm to solve it. The compaction/separation algorithms have been applied to marker making: the task of packing polygonal pieces on a sheet of cloth of fixed width so that total length is minimized. The compaction algorithm has improved cloth utilization of human generated pants markers. The separation algorithm together with a database of human-generated markers can be used for automatic generation of markers that approach human performance. %X tr-13-93.ps.gz %R TR-14-93 %D 1993 %T Universal Boolean Judges and Their Characterization %A Eyal Kushilevitz %A Silvio Micali %A Rafail Ostrovsky %X We consider the classic problem of $n$ honest (but curious) players with private inputs $x_1,\ldots, x_n$ who wish to compute the value of some pre-determined function $f(x_1,\ldots,x_n)$, so that at the end of the protocol every player knows the value of $f(x_1,\ldots,x_n)$. The players have unbounded computational resources and they wish to compute $f$ in a totally {\em private\/} ($n$-private) way. That is, after the completion of the protocol, which all players honestly follow, no coalition (of arbitrary size) can infer any information about the private inputs of the remaining players above of what is already been revealed by the value of $f(x_1,\ldots,x_n)$. Of course, with the help of a {\em trusted judge for computing $f$} , players can trivially compute $f$ in a totally private manner: every player secretly gives his input to the trusted judge and she announces the result. Previous research was directed towards implementing such a judge ``mentally'' by the players themselves, and was shown possible under various assumptions. Without assumptions, however, it was shown that most functions {\em can not\/} be computed in a totally private manner and thus we must rely on a trusted judge. If we have a trusted judge for $f$ we are done. Can we use a judge for a ``simpler'' function $g$ in order to compute $f$ $n$- privately? In this paper we initiate the study of the {\em complexity} of such judges needed to achieve total privacy for arbitrary $f$. We answer the following two questions: {\em How complicated such a judge should be, compared to $f$?} and {\em Does there exists some judge which can be used for all $f$?} We show that there exists {\bf universal boolean} judges (i.e. the ones that can be used for any $f$) and give a complete characterization of all the boolean functions which describe universal judges. In fact, we show, that a judge computing {\em any\/} boolean function $g$ which itself cannot be computed $n$- privately (i.e., when there is no judge available) is {\em universal\/} . Thus, we show that for all boolean functions, the notions of {\bf universality\/} and {\bf $n$-privacy} are {\em complimentary\/} . On the other hand, for non-boolean functions, we show that this two notions are {\em not\/} complimentary. Our result can be viewed as a strong generalization of the two-party case, where Oblivious Transfer protocols were shown to be universal. %R TR-15-93 %D 1993 %T An Unbundled Compiler %A Thomas Cheatham %R TR-16-93 %D 1993 %T Actions, Beliefs and Intentions in Multi-Action Utterances %A Cecile Tiberghien Balkanski %X Multi-action utterances convey critical information about agents' beliefs and intentions with respect to the actions they talk about or perform. Two such utterances may, for example, describe the same actions while the speakers of these utterances hold beliefs about these actions that are diametrically opposed. Hence, for a language interpretation system to understand multi-action utterances, it must be able (1) to determine the actions that are described and the ways in which they are related, and (2) to draw appropriate inferences about the agents' mental states with respect to these actions and action relations. This thesis investigates the semantics of two particular multi-action constructions: utterances with means clauses and utterances with rationale clauses. These classes of utterances are of interest not only as exemplars of multi-action utterances, but also because of the subtle differences in information that can be felicitously inferred from their use. Their meaning is shown to depend on the beliefs and intentions of the speaker and agents whose actions are being described as well as on the actions themselves. Thus, the thesis demonstrates (a) that consideration of mental states cannot be reserved to pragmatics and (b) that other aspects of natural language interpretation besides the interpretation of mental state verbs or plan recognition may provide information about mental states. To account for this aspect of natural language interpretation, this thesis presents a theory of logical form, a theory of action and action relations, an axiomatization of belief and intention, and interpretation rules for means clauses and rationale clauses. Together these different pieces constitute an interpretation model that meets the requirements specified in (1) and (2) above and that predicts the set of beliefs and intentions shown to be characteristic of utterances with means clauses and rationale clauses. This model has been implemented in the MAUI system (Multi-Action Utterance Interpreter), which accepts natural language sentences from a user, computes their logical form, and answers questions about the beliefs and intentions of the speaker and actor regarding the actions and action relations described. %R TR-17-93 %D 1993 %T Stochastic Approximation Algorithms for Number Partitioning %A Wheeler Ruml %X This report summarizes research on algorithms for finding particularly good solutions to instances of the NP-complete number -partitioning problem. Our approach is based on stochastic search algorithms, which iteratively improve randomly chosen initial solutions. Instead of searching the space of all $2^ {n-1} $ possible partitionings, however, we use these algorithms to manipulate indirect encodings of candidate solutions. An encoded solution is evaluated by a decoder, which interprets the encoding as instructions for constructing a partitioning of a given problem instance. We present several different solution encodings, including bit strings, permutations, and rule sets, and describe decoding algorithms for them. Our empirical results show that many of these encodings restrict and reshape the solution space in ways that allow relatively generic search methods, such as hill climbing, simulated annealing, and the genetic algorithm, to find solutions that are often as good as those produced by the best known constructive heuristic, and in many cases far superior. For the algorithms and representations we consider, the choice of solution representation plays an even greater role in determining performance than the choice of search algorithm. %X tr-17-93.ps.gz %R TR-18-93 %D 1993 %T POLYSHIFT Communications Software for the Connection Machine System CM-200 %A William George %A Ralph G. Brickner %A S. Lennart Johnsson %X We describe the use and implementation of a polyshift function {\bf PSHIFT} for circular shifts and end-off shifts. Polyshift is useful in many scientific codes using regular grids, such as finite difference codes in several dimensions, multigrid codes, molecular dynamics computations, and in lattice gauge physics computations, such as quantum Chromodynamics (QCD) calculations. Our implementation of the {\bf PSHIFT} function on the Connection Machine systems CM-2 and CM-200 offers a speedup of up to a factor of 3-4 compared to {\bf CSHIFT} when the local data motion within a node is small. The {\bf PSHIFT} routine is included in the Connection Machine Scientific Software Library (CMSSL). %X tr-18-93.ps.gz %R TR-19-93 %D 1993 %T High Performance, Scalable Scientific Software Libraries %A S. Lennart Johnsson %A Kapil K. Mathur %X Massively parallel processors introduces new demands on software systems with respect to performance, scalability, robustness and portability. The increased complexity of the memory systems and the increased range of problem sizes for which a given piece of software is used, poses serious challenges to software developers. The Connection Machine Scientific Software Library, CMSSL, uses several novel techniques to meet these challenges. The CMSSL contains routines for managing the data distribution and provides data distribution independent functionality. High performance is achieved through careful scheduling of operations and data motion, and through the automatic selection of algorithms at run-time. We discuss some of the techniques used, and provide evidence that CMSSL has reached the goals of performance and scalability for an important set of applications. %X tr-19-93.ps.gz %R TR-20-93 %D 1993 %T A Collaborative Planning Approach to Discourse Understanding %A Karen Lochbaum %X Approaches to discourse understanding fall roughly into two categories: those that treat mental states as elemental and thus reason directly about them, and those that do not reason about the beliefs and intentions of agents themselves, but about the propositions and actions that might be considered objects of those beliefs and intentions. The first type of approach is a mental phenomenon approach, the second a data-structure approach (Pollack, 1986b). In this paper, we present a mental phenomenon approach to discourse understanding and demonstrate its advantages over the data-structure approaches used by other researchers. The model we present is based on the collaborative planning framework of SharedPlans (Grosz and Sidner, 1990; Lochbaum, Grosz, and Sidner, 1990; Grosz and Kraus, 1993). SharedPlans are shown to provide a computationally realizable model of the intentional component of Grosz and Sidner's theory of discourse structure (Grosz and Sidner, 1986). Additionally, this model is shown to simplify and extend approaches to discourse understanding that introduce multiple types of plans to model an agent's motivations for producing an utterance (Litman, 1985; Litman and Allen, 1987; Ramshaw, 1991; Lambert and Carberry, 1991). %X tr-20-93.ps.gz %R TR-21-93 %D 1993 %T Evolving Line Drawings %A Ellie Baker %A Margo Seltzer %X This paper explores the application of interactive genetic algorithms to the creation of line drawings. We have built a system that starts with a collection of drawings that are either randomly generated or input by the user. The user selects one such drawing to mutate or two to mate, and a new generation of drawings is produced by randomly modifying or combining the selected drawing(s). This process of selection and procreation is repeated many times to evolve a drawing. A wide variety of complex sketches with highlighting and shading can be evolved from very simple drawings. This technique has enormous potential for augmenting and enhancing the power of traditional computer-aided drawing tools, and for expanding the repertoire of the computer- assisted artist. %X tr-21-93.ps.gz %R TR-22-93 %D 1993 %T A Stencil Compiler for the Connection Machine Models\\ CM-2/200 %A Ralph G. Brickner %A William George %A S. Lennart Johnsson %A \\ %A Alan Ruttenberg %X In this paper we present a Stencil Compiler for the Connection Machine Models CM-2 and CM-200. A {\em stencil} is a weighted sum of circularly-shifted CM Fortran arrays. The stencil compiler optimizes the data motion between processing nodes, minimizes the data motion within a node, and minimizes the data motion between registers and local memory in a node. The compiler makes novel use of the communication system and has highly optimized register use. The compiler natively supports two-dimensional stencils, but stencils in three or four dimensions are automatically decomposed. Portions of the system are integrated as part of the CM Fortram programming system, and also as part of the system microcode. The compiler is available as part of the Connection Machine Scientific Software Library (CMSSL) Release 3.1. %X tr-22-93.ps.gz %R TR-23-93 %D 1993 %T CMSSL: A Scalable Scientific Software Library %A S. Lennart Johnsson %X Massively parallel processors introduce new demands on software systems with respect to performance, scalability, robustness and portability. The increased complexity of the memory systems and the increased range of problem sizes for which a given piece of software is used poses serious challenges for software developers. The Connection Machine Scientific Software Library, CMSSL, uses several novel techniques to meet these challenges. The CMSSL contains routines for managing the data distribution and provides data distribution independent functionality. High performance is achieved through careful scheduling of operations and data motion, and through the automatic selection of algorithms at run-time. We discuss some of the techniques used, and provide evidence that CMSSL has reached the goals of performance and scalability for an important set of applications. %X tr-23-93.ps.gz %R TR-24-93 %D 1993 %T Annotating Floor Plans Using Deformable Polygons %A Kathy Ryall %A Joe Marks %A Murray Mazer %A Stuart Shieber %X The ability to recognize regions in a bitmap image has applications in various areas, from document recognition of scanned building floor plans to processing of scanned forms. We consider the use of deformable polygons for delineating partially or fully bounded regions of a scanned bitmap that depicts a building floor plan. We discuss a semi-automated interactive system, in which a user positions a seed polygon in an area of interest in the image. The computer then expands and deforms the polygon in an attempt to minimize an energy function that is defined so that configurations with minimum energy tend to match the subjective boundaries of regions in the image. When the deformation process is completed, the user may edit the deformed polygon to make it conform more closely to the desired region. In contrast to area-filling techniques for delineating areal regions of images, our approach works robustly for partially bounded regions. %X tr-24-93.ps.gz %R TR-01-94 %D 1994 %T Reasoning with Models %A Roni Khardon %A Dan Roth %X We develop a model-based approach to reasoning, in which the knowledge base is represented as a set of models (satisfying assignments) rather then a logical formula, and the set of queries is restricted. We show that for every propositional knowledge base (KB) there exists a set of {\em characteristic models} with the property that a query is true in KB if and only if it is satisfied by the models in this set. We fully characterize a set of theories for which the model-based representation is compact and provides efficient reasoning. These include cases where the formula-based representation does not support efficient reasoning. In addition, we consider the model-based approach to {\em abductive reasoning} and show that for any propositional KB, reasoning with its model-based representation yields an abductive explanation in time that is polynomial in its size. Some of our technical results make use of the {\em Monotone Theory}, a new characterization of Boolean functions introduced. \par The notion of {\em restricted queries} is inherent to our approach. This is a wide class of queries for which reasoning is very efficient and exact, even when the model-based representation KB provides only an approximate representation of the ``world''. \par Moreover, we show that the theory developed here generalizes the model-based approach to reasoning with Horn theories, and captures even the notion of reasoning with Horn-approximations. Our result characterizes the Horn theories for which the approach suggested in is useful and the phenomena observed there, regarding the relative sizes of the formula-based representation and model-based representation of KB is explained and put in a wider context. %X tr-01-94.ps.gz %R TR-02-94 %D 1994 %T Learning to Reason %A Roni Khardon %A Dan Roth %X We introduce a new framework for the study of reasoning. The Learning (in order) to Reason approach developed here combines the interfaces to the world used by known learning models with the reasoning task and a performance criterion suitable for it. In this framework the intelligent agent is given access to her favorite learning interface, and is also given a grace period in which she can interact with this interface and construct her representation KB of the world $W$. Her reasoning performance is measured only after this period, when she is presented with queries $\alpha$ from some query language, relevant to the world, and has to answer whether $W$ implies $\alpha$. \par The approach is meant to overcome the main computational difficulties in the traditional treatment of reasoning which stem from its separation from the ``world''. First, by allowing the reasoning task to interface the world (as in the known learning models), we avoid the rigid syntactic restriction on the intermediate knowledge representation. Second, we make explicit the dependence of the reasoning performance on the input from the environment. This is possible only because the agent interacts with the world when constructing her knowledge representation. \par We show how previous results from learning theory and reasoning fit into this framework and illustrate the usefulness of the Learning to Reason approach by exhibiting new results that are not possible in the traditional setting. First, we give a Learning to Reason algorithm for a class of propositional languages for which there are no efficient reasoning algorithms, when represented as a traditional (formula-based) knowledge base. Second, we exhibit a Learning to Reason algorithm for a class of propositional languages that is not known to be learnable in the traditional sense. %X tr-02-94.ps.gz %R TR-03-94 %D 1994 %T Mesh Decomposition and Communication Procedures for Finite Element Applications on The Connection Machine CM-5 System %A Z. Johan %A K.K. Mathur %A S.L. Johnsson %A T.J.R. Hughes %X TR-08-94 SUPERCEDES TR-03-94. %R TR-04-94 %D 1994 %T Data Parallel Finite Element Techniques for Compressible Flow Problems %A Z. Johan %A K.K. Mathur %A S.L. Johnsson %A T.J.R. Hughes %X We present a brief description of a finite element solver implemented on the Connection Machine CM-5 system. A more detailed presentation of the issues involved in such an implementation can be found in [1,2]. %X tr-04-94.ps.gz %R TR-05-94 %D 1994 %T Motion-Synthesis Techniques for 2D Articulated Figures %A Alex Fukunaga %A Lloyd Hsu %A Peter Reiss %A Andrew Shuman %A Jon Christensen %A Joe Marks %A J. Thomas Ngo %X In this paper we extend previous work on automatic motion synthesis for physically realistic 2D articulated figures in three ways. First, we describe an improved motion-synthesis algorithm that runs substantially faster than previously reported algorithms. Second, we present two new techniques for influencing the style of the motions generated by the algorithm. These techniques can be used by an animator to achieve a desired movement style, or they can be used to guarantee variety in the motions synthesized over several runs of the algorithm. Finally, we describe an animation editor that supports the interactive concatenation of existing, automatically generated motion controllers to produce complex, composite trajectories. Taken together, these results suggest how a usable, useful system for articulated-figure motion synthesis might be developed. %X tr-05-94.ps.gz %R TR-06-94 %D 1994 %T Motion Synthesis for 3D Articulated Figures and Mass-Spring Models %A Hadi Partovi %A Jon Christensen %A Amir Khosrowshahi %A Joe Marks %A J. Thomas Ngo %X Motion synthesis is the process of automatically generating visually plausible motions that meet goal criteria specified by a human animator. The objects whose motions are synthesized are often animated characters that are modeled as articulated figures or mass-spring lattices. Controller synthesis is a technique for motion synthesis that involves searching in a space of possible controllers to generate appropriate motions. Recently, automatic controller-synthesis techniques for 2D articulated figures have been reported. An open question is whether these techniques can be generalized to work for 3D animated characters. In this paper we report successful automatic controller synthesis for 3D articulated figures and mass-spring models that are subject to nonholonomic constraints. These results show that the 3D motion-synthesis problem can be solved in some challenging cases, though much work on this general topic remains to be done. %X tr-06-94.ps.gz %R TR-07-94 %D 1994 %T Parallel implementation of recursive spectral bisection on the Connection Machine CM-5 system %A Zden\u{e}k Johan %A Kapil K. Mathur %A S. Lennart Johnsson %A Thomas J.R. Hughes %X The recursive spectral bisection (RSB) algorithm was proposed by Pothen {\em et al.} [1] as the basis for computing small vertex separators for sparse matrices. Simon [2] applied this algorithm to mesh decomposition and showed that spectral bisection compared favorably with other decomposition techniques. Since then, the RSB algorithm has been widely accepted in the scientific community because of its robustness and its consistency in the high-quality partitionings it generates. The major drawback of the RSB algorithm is its high computing cost, as noted in [2], caused by the need for solving a series of eigenvalue problems. It is often stated that an unstructured mesh can be decomposed after it is generated, and the decomposition reused for the different calculations performed on that mesh. However, a new partitioning is to be obtained if adaptive mesh refinement is required. The mesh also has to be re-decomposed if the number of processing nodes available to the user changes between two calculations. In order to avoid the mesh decomposition from becoming a significant computational bottleneck, an efficient data-parallel implementation of the RSB algorithm using the CM Fortran language [3] is developed. In this paper, we present only an abbreviated description of the parallel implementation of the RSB algorithm, followed by two decomposition examples. Details of the implementation can be found in [4]. %X tr-07-94.ps.gz %R TR-08-94 %D 1994 %T Mesh Decomposition and Communication Procedures for Finite Element Applications on the Connection Machine CM-5 System %A Zden\u{e}k Johan %A Kapil K. Mathur %A S. Lennart Johnsson %A Thomas J.R. Hughes %X The objective of this paper is to analyze the impact of data mapping strategies on the performance of finite element applications. First, we describe a parallel mesh decomposition algorithm based on recursive spectral bisection used to partition the mesh into element blocks. A simple heuristic algorithm then renumbers the mesh nodes. Large three-dimensional meshes demonstrate the efficiency of those mapping strategies and assess the performance of a finite element program for fluid dynamics. %X tr-08-94.ps.gz %R TR-09-94 %D 1994 %T Data Motion and High Performance Computing %A S. Lennart Johnsson %X Efficient data motion has been of critical importance in high performance computing almost since the first electronic computers were built. Providing sufficient memory bandwidth to balance the capacity of processors led to memory hierarchies, banked and interleaved memories. With the rapid evolution of MOS technologies, microprocessor and memory designs, it is realistic to build systems with thousands of processors and a sustained performance of a trillion operations per second or more. Such systems require tens of thousands of memory banks, even when locality of reference is exploited. Using conventional technologies, interconnecting several thousand processors with tens of thousands of memory banks can feasibly only be made by some form of sparse interconnection network. Efficient use of locality of reference and network bandwidth is critical. We review these issues in this paper. %X tr-09-94.ps.gz %R TR-10-94 %D 1994 %T Easily Searched Encodings for Number Partitioning %A Wheeler Ruml %A J. Thomas Ngo %A Joe Marks %A Stuart M. Shieber %X Can stochastic search algorithms outperform existing deterministic heuristics for the NP-hard problem Number Partitioning if given a sufficient, but practically realizable amount of time? In a thorough empirical investigation using a straightforward implementation of one such algorithm, simulated annealing, Johnson et al. (1991) concluded tentatively that the answer is "no." In this paper we show that the answer can be "yes" if attention is devoted to the issue of problem representation (encoding). We present results from empirical tests of several encodings of Number Partitioning with problem instances consisting of multiple-precision integers drawn from a uniform probability distribution. With these instances and with an appropriate choice of representation, stochastic and deterministic searches can -- routinely and in a practical amount of time -- find solutions several orders of magnitude better than those constructed by the best heuristic known (Karmarkar and Karp, 1982), which does not employ searching. The choice of encoding is found to be more important than the choice of search technique in determining search efficacy. Three alternative explanations for the relative performance of the encodings are tested experimentally. The best encodings tested are found to contain a high proportion of good solutions; moreover, in those encodings, the solutions are organized into a single "bumpy funnel" centered at a known position in the search space. This is likely to be the only relevant structure in the search space because a blind search performs as well as any other search technique tested when the search space is restricted to the funnel tip. We also show how analogous representations might be designed in a principled manner for other difficult combinatorial optimization problems by applying the principles of parameterized arbitration, parameterized constraint, and parameterized greediness. \par Keywords: number partitioning, NP-complete, representation, encoding, empirical comparison, stochastic optimization, parameterized arbitration, parameterized constraint, parameterized greediness. %X tr-10-94r.ps.gz %R TR-11-94 %D 1994 %T Principles and Implementation of Deductive Parsing %A Stuart M. Shieber %A Yves Schabes %A Fernando C. N. Pereira %X We present a system for generating parsers based directly on the metaphor of parsing as deduction. Parsing algorithms can be represented directly as deduction systems, and a single deduction engine can interpret such deduction systems so as to implement the corresponding parser. The method generalizes easily to parsers for augmented phrase structure formalisms, such as definite-clause grammars and other logic grammar formalisms, and has been used for rapid prototyping of parsing algorithms for a variety of formalisms including variants of tree-adjoining grammars, categorial grammars, and lexicalized context-free grammars. %X tr-11-94.ps.gz %R TR-12-94 %D 1994 %T Multiple Containment Methods %A Karen Daniels %A Zhenyu Li %A Victor Milenkovic %X We present three different methods for finding solutions to the 2D translation-only {\em containment} problem: find translations for $k$ polygons that place them inside a given polygonal container without overlap. Both the container and the polygons to be placed in it may be non-convex. First, we provide several exact algorithms that improve results for $k=2$ or $k=3$. In particular, we give an algorithm for three convex polygons and a non-convex container with running time in ${\rm O}(m^3n\log mn)$, where $n$ is the number of vertices in the container, and $m$ is the sum of the vertices of the $k$ polygons. This is an improvement of a factor of $n^2$ over previous algorithms. Second, we give an approximation algorithm for $k$ non-convex polygons and a non-convex container based on restriction and subdivision of the configuration space. Third, we develop a MIP (mixed integer programming) model for $k$ non-convex polygons and a non-convex container. %X tr-12-94.ps.gz %R TR-13-94 %D 1994 %T A Recursive Coalescing Method for Bisecting Graphs %A Bryan Mazlish %A Stuart Shieber %A Joe Marks %X We present an extension to a hybrid graph-bisection algorithm developed by Bui et al. that uses vertex coalescing and the Kernighan-Lin variable-depth algorithm to minimize the size of the cut set. In the original heuristic technique, one iteration of vertex coalescing is used to improve the performance of the original Kernighan-Lin algorithm. We show that by performing vertex coalescing recursively, substantially greater improvements can be achieved for standard random graphs of average degree in the range [2.0,5.0]. %X tr-13-94.ps.gz %R TR-14-94 %D 1994 %T PRISC: Programmable Reduced Instruction Set Computers %A Rahul Razdan %X This thesis introduces Programmable Reduced Instruction Set Computers (PRISC) as a new class of general-purpose computers. PRISC use RISC techniques as a base, but in addition to the conventional RISC instruction resources, PRISC offer hardware programmable resources which can be configured based on the needs of a particular application. This thesis presents the architecture, operating system, and programming language compilation techniques which are needed to successfully build PRISC. Performance results are provided for the simplest form of PRISC -- a RISC microprocessor with a set of programmable functional units consisting of only combinational functions. Results for the SPECint92 benchmark suite indicate that an augmented compiler can provide a performance improvement of 22\% over the underlying RISC computer with a hardware area investment less than that needed for a 2 kilobyte SRAM. In addition, active manipulation of the source code leads to significantly higher local performance gains (250\%-500\%) for general abstract data types such as short-set vectors, hash tables, and finite state machines. Results on end-user applications that utilize these data types indicate a performance gain from 32\%-213\%. %X tr-14-94.tar.Z %R TR-15-94 %D 1994 %T Compaction Algorithms for Non-Convex Polygons and Their Applications %A Zhenyu Li %X Given a two-dimensional, non-overlapping layout of convex and non-convex polygons, {\em compaction} refers to a simultaneous motion of the polygons that generates a more densely packed layout. In industrial two-dimensional packing applications, compaction can improve the material utilization of already tightly packed layouts. Efficient algorithms for compacting a layout of non-convex polygons are not previously known. \par This dissertation offers the first systematic study of compaction of non-convex polygons. We start by formalizing the compaction problem as that of planning a motion that minimizes some linear objective function of the positions. Based on this formalization, we study the complexity of compaction and show it to be PSPACE-hard. \par The major contribution of this dissertation is a position-based optimization model that allows us to calculate directly new polygon positions that constitute a locally optimum solution of the objective via linear programming. This model yields the first practically efficient algorithm for translational compaction--compaction in which the polygons can only translate. This compaction algorithm runs in almost real time and improves the material utilization of production quality human-generated layouts from the apparel industry. \par Several algorithms are derived directly from the position-based optimization model to solve related problems arising from manual or automatic layout generation. In particular, the model yields an algorithm for separating overlapping polygons using a minimal amount of motion. This separation algorithm together with a database of human-generated markers can automatically generate markers that approach human performance. \par Additionally, we provide several extensions to the position-based optimization model. These extensions enables the model to handle small rotations, to offer flexible control of the distances between polygons and to find optimal solution to two-dimensional packing of non-convex polygons. \par This dissertation also includes a compaction algorithm based on existing physical simulation approaches. Although our experimental results showed that it is not practical for compacting tightly packed layouts, this algorithm is of interest because it shows that the simulation can speed up significantly if we use geometrical constraints to replace physical constraints. It also reveals the inherent limitations of physical simulation algorithms in compacting tightly packed layouts. \par Most of the algorithms presented in this dissertation have been implemented on a SUN ${\rm SparcStation}^{\rm TM}$ and have been included in a software package licensed to a CAD company. %X tr-15-94.ps.gz %R TR-16-94 %D 1994 %T Scalability of Finite Element Applications on Distributed-Memory Parallel Computers %A Zden\u{e}k Johan %A Kapil K. Mathur %A S. Lennart Johnsson %A Thomas J.R. Hughes %X This paper demonstrates that scalability and competitive efficiency can be achieved for unstructured grid finite element applications on distributed memory machines, such as the Connection Machine CM-5 system. The efficiency of finite element solvers is analyzed through two applications: an implicit computational aerodynamics application and an explicit solid mechanics application. Scalability of mesh decomposition and data mapping strategies are also discussed. Numerical examples that support the claims for problems with an excess of fourteen million variables are presented. %X tr-16-94.ps.gz %R TR-17-94 %D 1994 %T Improved Noise-Tolerant Learning and Generalized Statistical Queries %A Javed A. Aslam %A Scott E. Decatur %X The statistical query learning model can be viewed as a tool for creating (or demonstrating the existence of) noise-tolerant learning algorithms in the PAC model. The complexity of a statistical query algorithm, in conjunction with the complexity of simulating SQ algorithms in the PAC model with noise, determine the complexity of the noise-tolerant PAC algorithms produced. Although roughly optimal upper bounds have been shown for the complexity of statistical query learning, the corresponding noise-tolerant PAC algorithms are not optimal due to inefficient simulations. In this paper we provide both improved simulations and a new variant of the statistical query model in order to overcome these inefficiencies. \par We improve the time complexity of the classification noise simulation of statistical query algorithms. Our new simulation has a roughly optimal dependence on the noise rate. We also derive a simpler proof that statistical queries can be simulated in the presence of classification noise. This proof makes fewer assumptions on the queries themselves and therefore allows one to simulate more general types of queries. \par We also define a new variant of the statistical query model based on relative error, and we show that this variant is more natural and strictly more powerful than the standard additive error model. We demonstrate efficient PAC simulations for algorithms in this new model and give general upper bounds on both learning with relative error statistical queries and PAC simulation. We show that any statistical query algorithm can be simulated in the PAC model with malicious errors in such a way that the resultant PAC algorithm has a roughly optimal tolerable malicious error rate and sample complexity. \par Finally, we generalize the types of queries allowed in the statistical query model. We discuss the advantages of allowing these generalized queries and show that our results on improved simulations also hold for these queries. %X tr-17-94.ps.gz %R TR-18-94 %D 1994 %T Finite Element Techniques for Computational Fluid Dynamics on the Connection Machine CM-5 System %A Z. Johan %A K.K. Mathur %A S.L. Johnsson %A T.J.R. Hughes %X tr-18-94.ps.gz %R TR-19-94 %D 1994 %T Scientific Software Libraries for Scalable Architectures %A S. Lennart Johnsson %A Kapil K. Mathur %X Massively parallel processors introduce new demands on software systems with respect to performance, scalability, robustness and portability. The increased complexity of the memory systems and the increased range of problem sizes for which a given piece of software is used, poses serious challenges to software developers. The Connection Machine Scientific Software Library, CMSSL, uses several novel techniques to meet these challenges. The CMSSL contains routines for managing the data distribution and provides data distribution independent functionality. High performance is achieved through careful scheduling of arithmetic operations and data motion, and through the automatic selection of algorithms at run--time. We discuss some of the techniques used, and provide evidence that CMSSL has reached the goals of performance and scalability for an important set of applications. %X tr-19-94.ps.gz %R TR-20-94 %D 1994 %T Load-Balanced LU and QR Factor and Solve Routines for Scalable Processors with Scalable I/O %A Jean-Philippe Brunet %A Palle Pedersen %A S. Lennart Johnsson %X The concept of block--cyclic order elimination can be applied to out--of--core $LU$ and $QR$ matrix factorizations on distributed memory architectures equipped with a parallel I/O system. This elimination scheme provides load balanced computation in both the factor and solve phases and further optimizes the use of the network bandwidth to perform I/O operations. Stability of LU factorization is enforced by full column pivoting. Performance results are presented for the Connection Machine system CM--5. %X tr-20-94.ps.gz %R TR-21-94 %D 1994 %T ROMM Routing: A Class of Efficient Minimal Routing Algorithms %A Ted Nesson %A Lennart Johnsson %X ROMM is a class of Randomized, Oblivious, Multi--phase, Minimal routing algorithms. ROMM routing offers a potential for improved performance compared to fully randomized algorithms under both light and heavy loads. ROMM routing also offers close to best case performance for many common permutations. These claims are supported by extensive simulations of binary cube networks for a number of routing patterns. We show that $k\times n$ buffers per node suffice to make $k$--phase ROMM routing free from deadlock and livelock on $n$--dimensional binary cubes. %X tr-21-94.ps.gz %R TR-22-94 %D 1994 %T Issues in High Performance Computer Networks %A S. Lennart Johnsson %X tr-22-94.ps.gz %R TR-23-94 %D 1994 %T A Comparative Study of Search and Optimization Algorithms for the Automatic Control of Physically Realistic 2-D Animated Figures %A Alex Fukunaga %A Jon Christensen %A J. Thomas Ngo %A Joe Marks %X In the Spacetime Constraints paradigm of animation, the animator specifies what a character should do, and the details of the motion are generated automatically by the computer. Ngo and Marks recently proposed a technique of automatic motion synthesis that uses a massively parallel genetic algorithm to search a space of motion controllers that generate physically realistic motions for 2D articulated figures. In this paper, we describe an empirical study of evolutionary computation algorithms and standard function optimization algorithms that were implemented in lieu of the massively parallel GA in order to find a substantially more efficient search algorithm that would be viable on serial workstations. We discovered that simple search algorithms based on the evolutionary programming paradigm were most efficient in searching the space of motion controllers. %X tr-23-94.ps.gz %R TR-24-94 %D 1994 %T Implementing O(N) N-body Algorithms Efficiently in Data Parallel Languages (High Performance Fortran) %A Yu Hu %A S. Lennart Johnsson %X O(N) algorithms for N-body simulations enable the simulation of particle systems with up to 100 million particles on current Massively Parallel Processors (MPPs). Our optimization techniques mainly focus on minimizing the data movement through careful management of the data distribution and the data references, both between the memories of different nodes, and within the memory hierarchy of each node. We show how the techniques can be expressed in languages with an array syntax, such as Connection Machine Fortran (CMF). All CMF constructs used, with one exception, are included in High Performance Fortran. \par The effectiveness of our techniques is demonstrated on an implementation of Anderson's hierarchical O(N) N-body method for the Connection Machine system CM-5/5E. Of the total execution time, communication accounts for about 10-20\% of the total time, with the average efficiency for arithmetic operations being about 40\% and the total efficiency (including communication) being about 35\%. For the CM-5E a performance in excess of 60 Mflop/s per node (peak 160 Mflop/s per node) has been measured. %X tr-24-94.ps.gz %R TR-25-94 %D 1994 %T Using Collaborative Plans to Model the Intentional Structure of Discourse %A Karen E. Lochbaum %X An agent's ability to understand an utterance depends upon its ability to relate that utterance to the preceding discourse. The agent must determine whether the utterance begins a new segment of the discourse, completes the current segment, or contributes to it. The intentional structure of the discourse, comprised of discourse segment purposes and their interrelationships, plays a central role in this process (Grosz and Sidner, 1986). In this thesis, we provide a computational model for recognizing intentional structure and utilizing it in discourse processing. The model specifies how an agent's beliefs about the intentions underlying a discourse affects and are affected by its subsequent discourse. We characterize this process for both interpretation and generation and then provide specific algorithms for modeling the interpretation process. \par The collaborative planning framework of SharedPlans (Lochbaum, Grosz, and Sidner, 1990; Grosz and Kraus, 1993) provides the basis for our model of intentional structure. Under this model, agents are taken to engage in discourses and segments of discourses for reasons that derive from the mental state requirements of action and collaboration. Each utterance of a discourse is understood in terms of its contribution to the SharedPlans in which the discourse participants are engaged. We demonstrate that this model satisfies the requirements of Grosz and Sidner's (1986) theory of discourse structure and also simplifies and extends previous plan-based approaches to dialogue understanding. The model has been implemented in a system that demonstrates the contextual role of intentional structure in both interpretation and generation. %X tr-25-94.ps.gz %R TR-26-94 %D 1994 %T A Data Parallel Implementation of Hierarchical N-body Methods %A Yu Hu %A S. Lennart Johnsson %X The O(N) hierarchical N-body algorithms and Massively Parallel Processors allow particle systems of 100 million particles or more to be simulated in acceptable time. We describe a data parallel implementation of Anderson's method and demonstrate both efficiency and scalability of the implementation on the Connection Machine CM-5/5E systems. The communication time for large particle systems amounts to about 10-25\%, and the overall efficiency is about 35\%. On a CM-5E the overall performance is about 60 Mflop/s per node, independent of the number of nodes. %X tr-26-94.ps.gz %R TR-27-94 %D 1994 %T Local Basic Linear Algebra Subroutines (LBLAS) for the CM--5/5E %A David Kramer %A S. Lennart Johnsson %A Yu Hu %X The Connection Machine Scientific Software Library (CMSSL) is a library of scientific routines designed for distributed memory architectures. The BLAS of the CMSSL have been implemented as a two--level structure to exploit optimizations local to nodes and across nodes. This paper presents the implementation considerations and performance of the Local BLAS, or BLAS local to each node of the system. A wide variety of loop structures and unrollings have been implemented in order to achieve a uniform and high performance, irrespective of the data layout in node memory. The CMSSL is the only existing high--performance library capable of supporting both the data parallel and message passing modes of programming a distributed memory computer. The implications of implementing BLAS on distributed memory computers are considered in this light. %X tr-27-94.ps.gz %R TR-28-94 %D 1994 %T Infrastructure for Research towards Ubiquitous Information Systems %A Barbara Grosz %A H.T. Kung %A Margo Seltzer %A Stuart Shieber %A Michael Smith %X tr-28-94.ps.gz %R TR-29-94 %D 1994 %T Automatic Derivation of Parallel and Systolic Programs %A Lilei Chen %X We present a simple method for developing parallel and systolic programs from data dependence. We derive sequences of parallel computations and communications based on data dependence and communication delays, and minimize the communication delays and processor idle time. The potential applications for this method include supercompiling, automatic development of parallel programs, and systolic array design. %X tr-29-94.ps.gz %R TR-30-94 %D 1994 %T VINO: An Integrated Platform for Operating System and Database Research %A Christopher Small %A Margo Seltzer %X In 1981, Stonebraker wrote: \par Operating system services in many existing systems are either too slow or inappropriate. Current DBMSs usually provide their own and make little or no use of those offered by the operating system. \par The standard operating system model has changed little since that time, and we believe that, at its core, it is the {\em wrong} model for DBMS and other resource-intensive applications. The standard model is inflexible, uncooperative, and irregular in its treatment of resources. \par We describe the design of a new system, the VINO kernel, which addresses the limitations of standard operating systems. It focuses on three key ideas: \par - Applications direct policy. - Kernel mechanisms are reusable by applications. - All resources share a common extensible interface. \par VINO's power and flexibility make it an ideal platform for the design and implementation of traditional and modern database management systems. %X tr-30-94.ps.gz %R TR-31-94 %D 1994 %T Abstract Execution in a Multi-Tasking Environment %A David Mazi\'{e}res %A Michael D. Smith %X Tracing software execution is an important part of understanding system performance. Raw CPU power has been increasing at a rate far greater than memory and I/O bandwidth, with the result that the performance of client/server and I/O-bound applications is not scaling as one might hope. Unfortunately, the behavior of these types of applications is particularly sensitive to the kinds of distortion induced by traditional tracing methods, so that current traces are either incomplete or of questionable accuracy. Abstract execution is a powerful tracing technique which was invented to speed the tracing of single processes and to store trace data more compactly. In this work, abstract execution was extended to trace multi-tasking workloads. The resulting system is more than 5 times faster than other current methods of gathering multi-tasking traces, and can therefore generate traces with far less time distortion. %X tr-31-94.ps.gz %R TR-32-94 %D 1994 %T Rationality %A L.G. Valiant %X tr-32-94.ps.gz %R TR-33-94 %D 1994 %T Derivatives of the Matrix Exponential and their Computation %A Igor Najfeld %A Timothy F. Havel %X Matrix exponentials and their derivatives play an important role in the perturbation analysis, control and parameter estimation of linear dynamical systems. The well-known integral representation of the derivative of the matrix exponential $\exp(t{\bf A})$ in the matrix direction ${\bf V}$, $\int_0^t{\bf\exp(({\it t}-\tau)A)V\exp(\tau A)}\,{\rm d}\tau$, enables us to derive a number of new properties of this derivative, along with spectral, series and exact representations. Many of these results extend to arbitrary analytic functions of a matrix argument, for which we have also derived a simple relation between the gradients of their entries and the directional derivatives in the elementary directions. Based on these results, we construct and optimize two new algorithms for computing the directional derivative. We have also developed a new algorithm for computing the matrix exponential, based on a rational representation of the exponential in terms of the hyperbolic function ${\bf A}{\bf coth}({\bf A})$, which is more efficient than direct Pade approximation. Finally, these results are illustrated by an application to a biologically important parameter estimation problem which arises in nuclear magnetic resonance spectroscopy. %X tr-33-94.ps.gz %R TR-34-94 %D 1994 %T VINO: The 1994 Fall Harvest %A Yasuhiro Endo %A James Gwertzman %A Margo Seltzer %A Christopher Small %A Keith A. Smith %A Diane Tang %X tr-34-94.ps.gz %R TR-35-94 %D 1994 %T File Layout and File System Performance %A Keith Smith %A Margo Seltzer %X Most contemporary implementations of the Berkeley Fast File System optimize file system throughput by allocating logically sequential data to physically contiguous disk blocks. This clustering is effective when there are many contiguous free blocks on the file system. But the repeated creation and deletion of files of varying sizes that occurs over time on active file systems is likely to cause fragmentation of free space, limiting the ability of the file system to allocate data contiguously and therefore degrading performance. \par This paper presents empirical data and the analysis of allocation and fragmentation in the SunOS 4.1.3 file system (a derivative of the Berkeley Fast File System). We have collected data from forty-eight file systems on four file servers over a period of ten months. Our data show that small files are more fragmented than large files, with fewer than 35% of the blocks in two block files being allocated optimally, but more than 80% of the blocks in files larger than 256 kilobytes being allocated optimally. Two factors are respon sible for this difference in fragmentation, an uneven distribution of free space within file system cylinder groups and a disk allocation algorithm which frequently allocates the last block of a file discontiguously from the rest of the file. \par Performance measurements on replicas of active file systems show that they seldom perform as well as comparable empty file systems but that this performance degradation is rarely more than 10-15%. This decline in performance is directly correlated to the amount of fragmentation in the files used by the benchmark programs. Both file system utilization and the amount of fragmentation in existing files on the file system influence the amount of fragmentation in newly created files. Characteristics of the file system workload also have a significant impact of file system fragmentation and performance, with typical news server workloads causing extreme fragmentation. %X tr-35-94.ps.gz %R TR-36-94 %D 1994 %T Bulk Synchronous Parallel Computing -- A Paradigm for Transportable Software %A Thomas Cheatham %A Amr Fahmy %A Dan C. Stefanescu %A Leslie G. Valiant %X A necessary condition for the establishment, on a substantial basis, of a parallel software industry would appear to be the availability of technology for generating transportable software, i.e. architecture independent software which delivers scalable performance for a wide variety of applications on a wide range of multiprocessor computers. This paper describes H-BSP -- a general purpose parallel computing environment for developing transportable algorithms. H-BSP is based on the Bulk Synchronous Parallel Model (BSP), in which a computation involves a number of supersteps, each having several parallel computational threads that synchronize at the end of the superstep. The BSP Model deals explicitly with the notion of communication among computational threads and introduces parameters g and L that quantify the ratio of communication throughput to computation throughput, and the synchronization period, respectively. These two parameters, together with the number of processors and the problem size, are used to quantify the performance and, therefore, the transportability of given classes of algorithms across machines having different values for these parameters. This paper describes the role of unbundled compiler technology in facilitating the development of such a parallel computer environment. %Xtr-36-94.ps.gz %R TR-01-95 %D 1995 %T Bayesian Grammar Induction for Language Modeling %A Stanley F. Chen %X We describe a corpus-based induction algorithm for probabilistic context-free grammars. The algorithm employs a greedy heuristic search within a Bayesian framework, and a post-pass using the Inside-Outside algorithm. We compare the performance of our algorithm to n-gram models and the Inside-Outside algorithm in three language modeling tasks. In two of these domains, our algorithm outperforms these other techniques, marking the first time a grammar-based language model has surpassed n-gram modeling in a task of at least moderate size. %X tr-01-95.ps.gz %R TR-02-95 %D 1995 %T Learning in Order to Reason %A Dan Roth %X Any theory aimed at understanding {\em commonsense} reasoning, the process that humans use to cope with the mundane but complex aspects of the world in evaluating everyday situations, should account for its flexibility, its adaptability, and the speed with which it is performed. \par In this thesis we analyze current theories of reasoning and argue that they do not satisfy those requirements. We then proceed to develop a new framework for the study of reasoning, in which a learning component has a principal role. We show that our framework efficiently supports a lot ``more reasoning" than traditional approaches and at the same time matches our expectations of plausible patterns of reasoning in cases where other theories do not. \par In the first part of this thesis we present a computational study of the knowledge-based system approach, the generally accepted framework for reasoning in intelligent systems. We present a comprehensive study of several methods used in approximate reasoning as well as some reasoning techniques that use approximations in an effort to avoid computational difficulties. We show that these are even harder computationally than exact reasoning tasks. What is more surprising is that, as we show, even the approximate versions of these approximate reasoning tasks are intractable, and these severe hardness results on approximate reasoning hold even for very restricted knowledge representations. \par Motivated by these computational considerations we argue that a central question to consider, if we want to develop computational models for commonsense reasoning, is how the intelligent system acquires its knowledge and how this process of interaction with its environment influences the performance of the reasoning system. The {\em Learning to Reason} framework developed and studied in the rest of the thesis exhibits the role of inductive learning in achieving efficient reasoning, and the importance of studying reasoning and learning phenomena together. The framework is defined in a way that is intended to overcome the main computational difficulties in the traditional treatment of reasoning, and indeed, we exhibit several positive results that do not hold in the traditional setting. We develop Learning to Reason algorithms for classes of theories for which no efficient reasoning algorithm exists when represented as a traditional (formula-based) knowledge base. We also exhibit Learning to Reason algorithms for a class of theories that is not known to be learnable in the traditional sense. Many of our results rely on the theory of model-based representations that we develop in this thesis. In this representation, the knowledge base is represented as a set of models (satisfying assignments) rather than a logical formula. We show that in many cases reasoning with a model-based representation is more efficient than reasoning with a formula-based representation and, more significantly, that it suggests a new view of reasoning, and in particular, of logical reasoning. \par In the final part of this thesis, we address another fundamental criticism of the knowledge-based system approach. We suggest a new approach for the study of the non-monotonicity of human commonsense reasoning, within the Learning to Reason framework. The theory developed is shown to support efficient reasoning with incomplete information, and to avoid many of the representational problems which existing default reasoning formalisms face. \par We show how the various reasoning tasks we discuss in this thesis relate to each other and conclude that they are all supported together naturally. %X tr-02-95.ps.gz %R TR-03-95 %D 1995 %T Translating between Horn Representations and their Characteristic Models %A Roni Khardon %X Characteristic models are an alternative, model based, representation for Horn expressions. It has been shown that these two representations are incomparable and each has its advantages over the other. It is therefore natural to ask what is the cost of translating, back and forth, between these representations. Interestingly, the same translation questions arise in database theory, where it has applications to the design of relational databases. \par We study the complexity of these problems and prove some positive and negative results. Our main result is that the two translation problems are equivalent under polynomial reductions, and that they are equivalent to the corresponding decision problem. Namely, translating is equivalent to deciding whether a given set a models is the set of characteristic models for a given Horn expression. \par We also relate these problems to translating between the CNF and DNF representations of monotone functions, a well known problem for which no polynomial time algorithm is known. It is shown that in general our translation problems are at least as hard as the latter, and in a special case they are equivalent to it. %X tr-03-95.ps.gz %R TR-04-95 %D 1995 %T Volume of a Hyper-Parallelepiped after Affine Transformations, and its Application to Optimal Parallel Loop Execution %A Yan-Zhong Ding %A Dan Stefanescu %X This paper presents a theoretical framework for the efficient scheduling of a class of parallel loop nests on distributed memory parallel computers. The method generates two classes of schedules, evaluates them according to a full-fledged cost model and then selects the best option. The cost model used is the Bulk Synchronous Parallel model. The method can generate schedules whose efficiency is tailored to any parallel architecture and any parameters characterizing the parallel loops. As an application, we generate optimal schedules for the matrix-matrix multiplication problem for general matrices, thus extending previous results for square matrices. This is an example of a compiler optimization for transportable parallel software. %R TR-05-95 %D 1995 %T A Proposed New Memory Manager %A Robert L. Walton %X Memory managers should support compactification, multiple simultaneous garbage collections, and ephemeral collections in a realtime multi-processor shared memory environment. They should permit old addresses of an object to be invalidated without significant delay, and should permit array accesses with no per-element inefficiency. \par A new approach to building an optimal standard solution to these requirements is presented for stock hardware and next generation languages. If such an approach should become a standard, this would spur the development of standard hardware to optimize away the overhead. %X tr-05-95.ps.gz %R TR-06-95 %D 1995 %T A Comparative Analysis of Schemes for Correlated Branch Prediction %A Cliff Young %A Nicolas Gloy %A Michael D. Smith %X Modern high-performance architectures require extremely accurate branch prediction to overcome the performance limitations of con ditional branches. We present a framework that categorizes branch prediction schemes by the way in which they partition dynamic branches and by the kind of predictor that they use. The framework allows us to compare and contrast branch prediction schemes, and to analyze why they work. We use the framework to show how a static correlated branch prediction scheme increases branch bias and thus improves overall branch prediction accuracy. We also use the framework to identify the fundamental differences between static and dynamic correlated branch prediction schemes. This study shows that there is room to improve the prediction accuracy of existing branch prediction schemes. %X tr-06-95.ps.gz %R TR-07-95 %D 1995 %T Efficient Learning of Real Time One-Counter Automata %A Amr Fahmy %A Robert Roos %X We present an efficient learning algorithm for languages accepted by deterministic real time one counter automata (ROCA). The learning algorithm works by first learning an initial segment, $B_n$, of the infinite state machine that accepts the unknown language and then decomposing it into a complete control structure and a partial counter. A new efficient ROCA decomposition algorithm, which will be presented in detail, allows this result. The decomposition algorithm works in $O(n^2log(n))$ where $nc$ is the number of states of $B_n$ for some constant $c$. \par If Angluin's algorithm for learning regular languages is used to learn $B_n$ and the complexity of this step is $h(n,m)$ where $m$ is the length of the longest counter example necessary for Angluin's algorithm, the complexity of our algorithm is thus $O(h(n,m) + n^2 log(n))$. %X tr-07-95.ps.gz %R TR-08-95 %D 1995 %T ROMM Routing on Mesh and Torus Networks %A Ted Nesson %A S. Lennart Johnsson %X ROMM is a class of Randomized, Oblivious, Multi--phase, Minimal routing algorithms. ROMM routing offers a potential for improved performance compared to both fully randomized algorithms and deterministic oblivious algorithms, under both light and heavy loads. ROMM routing also offers close to best case performance for many common routing problems. In previous work, these claims were supported by extensive simulations on binary cube networks . Here we present analytical and empirical results for ROMM routing on wormhole routed mesh and torus networks. Our simulations show that ROMM algorithms can perform several representative routing tasks 1.5 to 3 times faster than fully randomized algorithms, for medium--sized networks. Furthermore, ROMM algorithms are always competitive with deterministic, oblivious routing, and in some cases, up to 2 times faster. %X tr-08-95.ps.gz %R TR-09-95 %D 1995 %T The Impact of Operating System Structure on Personal Computer Performance %A J. Bradley Chen %A Yasuhiro Endo %A Kee Chan %A David Mazieres %A Antonio Dias %A Margo Seltzer %A Michael Smith %X This paper presents a comparative study of the performance of three operating systems that run on the personal computer architecture derived from the IBM-PC. The operating systems, Windows for Workgroups (tm), Windows NT (tm), and NetBSD (a freely available UNIX (tm) variant) cover a broad range of system functionality and user requirements, from a single address space model to full protection with preemptive multitasking. Our measurements were enabled by hardware counters in Intel's Pentium (tm) processor that permit measurement of a broad range of processor events including instruction counts and on-chip cache miss rates. We used both microbenchmarks, which expose specific differences between the systems, and application workloads, which provide an indication of expected end-to-end performance. Our microbenchmark results show that accessing system functionality is more expensive in Windows than in the other two systems due to frequent changes in machine mode and the use of system call hooks. When running native applications, Windows NT is more efficient than Windows, but it does incur overhead from its microkernel structure. Overall, system functionality can be accessed most efficiently in NetBSD; we attribute this to its monolithic structure, and to the absence of the complications created by backwards compatibility in the other sys tems. Measurements of application performance show that the impact of these differences is significant in terms of overall execution time. %X tr-09-95.ps.gz %R TR-10-95 %D 1995 %T Managing Design Complexity: Using Stochastic Optimization in the Production of Computer Graphics %A Jon Christensen %X This thesis examines the automated design of computer graphics. We present a methodology that emphasizes optimization, problem representation, stochastic search, and empirical analysis. Two problems are considered, which together encompass and exemplify both 2D and 3D graphics production: label placement and motion synthesis for animation. \par Label placement is the problem of annotating various informational graphics with textual labels, subject to constraints that respect proper label-feature associativity, label and feature obscuration, and aesthetically desirable label positions. Examples of label placement include applying textual labels to a geographic map, or item tags to a scatterplot. Motion synthesis is the problem of composing a visually plausible motion for an animated character, subject to animator-imposed constraints on the form and characteristics of the desired motion. \par For each problem we propose new solution methods that utilize efficient problem representations combined with stochastic optimization techniques. We demonstrate that these methods offer significant advantages over competing solutions in terms of ease-of-use, visual quality, and computational efficiency. Taken together, these results also demonstrate an effective approach for continued progress in automating graphical design, which should be applicable to a wide range of graphical design applications beyond the two considered here. %X tr-10-95.ps.gz %X tr-10-95-p70.ps.gz %X tr-10-95-p71.ps.gz %R TR-11-95 %D 1995 %T Interpreting Cohesive Forms in the Context of Discourse Inference %A Andrew Kehler %X In this thesis, we present analyses and algorithms for resolving a variety of cohesive phenomena in natural language, including VP-ellipsis, gapping, event reference, tense, and pronominal reference. Past work has attempted to explain the complicated behavior of these expressions with theories that operate within a single module of language processing. We argue that such approaches cannot be maintained; in particular, the data we present strongly suggest that the nature of the coherence relation operative between clauses needs to be taken into account. \par We provide a theory of coherence relations and the discourse inference processes that underly their recognition. We utilize this theory to break the deadlock between syntactic and semantic approaches to resolving VP-ellipsis. We show that the data exhibits a pattern with respect to our categorization of coherence relations, and present an account which predicts this pattern. We extend our analysis to gapping and event reference, and show that our analyses result in a more independently-motivated and empirically-adequate distinction among types of anaphoric processes than past analyses. \par We also present an account of VP-ellipsis resolution that predicts the correct set of `strict' and `sloppy' readings for a number of benchmark examples that are problematic for past approaches. The correct readings can be seen to result from a general distinction between `referring' and `copying' in anaphoric processes. The account also extends to other types of reference, such as event reference and `one'-anaphora. \par Finally, we utilize our theory of coherence in analyses that break the deadlock between definite-reference and coherence-based approaches to tense and pronoun interpretation. We present a theory of tense interpretation that interacts with discourse inference processes to predict data that is problematic for both types of approach. We demonstrate that the data commonly cited in the pronoun interpretation literature also exhibits a pattern with respect to coherence relations, and make some preliminary proposals for how such a pattern might result from the properties of the different types of discourse inference we posit. %X tr-11-95.ps.gz %R TR-12-95 %D 1995 %T Containment Algorithms for Nonconvex Polygons with Applications to Layout %A Karen McIntosh Daniels %X Layout and packing are NP-hard geometric optimization problems which appear in a variety of manufacturing industries. At their core, layout and packing problems have the common geometric feasibility problem of {\em containment}: find a way of placing a set of items into a container. We focus on containment and its applications to layout and packing problems. We demonstrate that, although containment is NP-hard, it is fruitful to: 1) develop algorithms for containment, as opposed to heuristics, 2) design containment algorithms so that they say ``no'' almost as fast as they say ``yes'', 3) use geometric techniques, not just mathematical programming techniques, and 4) maximize the number of items for which the algorithms are practical. \par Our approach to containment is based on a new restrict/evaluate/subdivide paradigm. We develop theory and practical techniques for the operations within the paradigm. The techniques are appropriate for two-dimensional containment problems in which the items and container may be irregular polygons, and in which the items may be translated, but not rotated. Our techniques can be combined to form a variety of two-dimensional translational containment algorithms. The paradigm is designed so that, unlike existing iteration-based algorithms, containment algorithms based on the paradigm are adept at saying ``no'', even for slightly infeasible problems. We present two algorithms based on our paradigm. We obtain the first practical running times for NP-complete two-dimensional translational containment problems for up to ten nonconvex items in a nonconvex container. \par We demonstrate that viewing containment as a feasibility problem has many benefits for packing and layout problems. For example, we present an effective method for finding minimal enclosures which uses containment to perform binary search on a parameter. Compaction techniques can accelerate the search. We also use containment to develop the first practical pre-packing strategy for a multi-stage pattern layout problem in apparel manufacturing. Pre-packing is a layout method which packs items into a collection of containers by first generating groups of items which fit into each container and then assigning groups to containers. %X tr-12-95.ps.gz %R TR-13-95 %D 1995 %T Probabilistic Cache Replacement %A J. Bradley Chen %X Modern microprocessors tend to use on-chip caches that are much smaller than the working set size of many interesting computations. In such situations, cache performance can be improved through selective caching, use of cache replacement policies where data fetched from memory, although forwarded to the CPU, is not necessarily loaded into the cache. This paper introduces a selective caching policy called Probabilistic Cache Replacement (PCR) in which caching of data fetched from main memory is determined by a probabilistic boolean-valued function. Use of PCR creates a self-selection mechanism in which repeated misses to a word in memory increase its probability of being loaded into the cache. A PCR cache gives better reductions in instruction cache miss rate than a comparable cache configuration with a victim-cache. Instruction cache miss rates can be reduced by up to 30% for some of the SPECmarks, although the optimal probability distribution is workload dependent. This paper also presents a mechanism called Feedback PCR which dynamically selects probability values for a PCR cache. For an 16 K byte direct-mapped instruction cache, Feedback PCR with a one-entry MFB gives an average reduction in cache misses of over 11% across the SPECmarks with no significant increase in cache misses for any of the workloads, and compares favorably with other alternatives of similar hardware cost. %X tr-13-95.ps.gz %R TR-14-95 %D 1995 %T MOSS: A Mobile Operating Systems Substrate %A J. Bradley Chen %A H.T. Kung %A Margo Seltzer %X The Mobile Operating System Substrate (MOSS) is a new system architecture for wireless mobile computing being developed at Harvard. MOSS provides highly efficient, robust and flexible virtual device access over wireless media. MOSS services provide mobile access to such resources as disks, CD ROM drives, displays, wired network interfaces, and audio and video devices. MOSS services are composed of virtual circuits and virtual devices. Virtual circuits (VCs) on wireless media support the spectrum of quality-of-service (QoS) levels required to cover a broad range of application requirements. Virtual devices implement resource access using VCs as their communication substrate. The tight coupling of network code and device implementations makes it possible to apply device-specific semantics to communications resource management problems. MOSS will enable mobile software systems to adapt dynamically to the rapidly changing computing and communications environment created by mobility. %X tr-14-95.ps.gz %R TR-15-95 %D 1995 %T Reasoning with Examples: Propositional Formulae and Database Dependencies %A Roni Khardon %A Heikki Mannila %A Dan Roth %X For humans, looking at how concrete examples behave is an intuitive way of deriving conclusions. The drawback with this method is that it does not necessarily give the correct results. However, under certain conditions example-based deduction can be used to obtain a correct and complete inference procedure. This is the case for Boolean formulae (reasoning with models) and for certain types of database integrity constraints (the use of Armstrong relations). We show that these approaches are closely related, and use the relationship to prove new results about the existence and sizes of Armstrong relations for Boolean dependencies. Further, we study the problem of translating between different representations of relational databases, in particular we consider Armstrong relations and Boolean dependencies, and prove some positive results in that context. Finally, we discuss the close relations between the questions of finding keys in relational databases and that of finding abductive explanations. %X tr-15-95.ps.gz %R TR-16-95 %D 1995 %T The Case for Extensible Operating Systems %A Margo Seltzer %A Keith Smith %A Christopher Small %X Many of the performance improvements cited in recent operating systems research describe specific enhancements to normal operating system functionality that improve performance in a set of designated test cases. Global changes of this sort can improve performance for one application, at the cost of decreasing performance for others. We argue that this flurry of global kernel tweaking is an indication that our current operating system model is inappropriate. Existing interfaces do not provide the flexibility to tune the kernel on a per-application basis, to suit the variety of applications that we now see. \par We have failed in the past to be omniscient about future operating system requirements; there is no reason to believe that we will fare any better designing a new, fixed kernel interface today. Instead, the only general-purpose solution is to build an operating system interface that is easily extendable. We present a kernel framework designed to support the application-specific customization that is beginning to dominate the operating system literature. We show how this model enables easy implementation of many of the earlier research results. We then analyze two specific kernel policies: page read-ahead and lock-granting. We show that application-control over read-ahead policy produces performance improvements of up to 16\%. We then show how application-control over the lock-granting policy can choose between fairness and response time. Reader priority algorithms produce lower read response time at the cost of writer starvation. FIFO algorithms avoid the starvation problem, but increase read response time. %X tr-16-95.ps.gz %R TR-17-95 %D 1995 %T Autonomous Replication in Wide-Area Internetworks %A James Gwertzman %X The number of users connected to the Internet has been growing at an exponential rate, resulting in similar increases in network traffic and Internet server load. Advances in microprocessors and network technologies have kept up with growth so far, but we are reaching the limits of hardware solutions. In order for the Internet's growth to continue, we must efficiently distribute server load and reduce the network traffic generated by its various services. \par Traditional wide-area caching schemes are client initiated. Decisions on where and when to cache information are made without the benefit of the server's global knowledge of the situation. We introduce a technique, push-caching, that is server initiated; it leaves caching decisions to the server. The server uses its knowledge of network topology, geography, and access patterns to minimize network traffic and server load. \par The World Wide Web is an example of a large-scale distributed information system that will benefit from this geographical distribution, and we present an architecture that allows a Web server to autonomously replicate Web files. We use a trace-driven simulation of the Internet to evaluate several competing caching strategies. Our results show that while simple client caching reduces server load and network bandwidth demands by up to 30\%, adding server-initiated caching reduces server load by an additional 20\% and network bandwidth demands by an additional 10\%. Furthermore, push-caching is more efficient than client-caching, using an order of magnitude less cache space for comparable bandwidth and load savings. \par To determine the optimal cache consistency protocol we used a generic server simulator to evaluate several cache-consistency protocols, and found that weak consistency protocols are sufficient for the World Wide Web since they use the same bandwidth as an atomic protocol, impose less server load, and return stale data less than 1\% of the time. %X tr-17-95.ps.gz %R TR-18-95 %D 1995 %T Centering: A Framework for Modelling the Local Coherence of Discourse %A Barbara J. Grosz %A Aravind K. Joshi %A Scott Weinstein %X tr-18-95.ps.gz %R TR-19-95 %D 1995 %T Benchmarking Filesystems %A Diane L. Tang %X One of the most widely researched areas in operating systems is filesystem design, implementation, and performance. Almost all of the research involves reporting performance numbers gathered from a variety of different benchmarks. The problem with such results is that existing filesystem benchmarks are inadequate, suffering from problems ranging from not scaling with advancing technology to not measuring the filesystem. \par A new approach to filesystem benchmarking is presented here. This methodology is designed both to help system designers understand and improve existing systems and to help users decide which filesystem to buy or run. For usability, the benchmark is separated into two parts: a suite of micro-benchmarks, which is actually run on the filesystem, and a workload characterizer. The results from the two separate parts can be combined to predict the performance of the filesystem on the workload. \par The purpose for this separation of functionality is two-fold. First, many system designers would like their filesystem to perform well under diverse workloads: by characterizing the workload independently, the designers can better understand what is required of the filesystem. The micro-benchmarks tell the designer what needs to be improved while the workload characterizer tells the designer whether that improvement will affect filesystem performance under that workload. This separation also helps users trying to decide which system to run or buy, who may not be able to run their workload on all systems under consideration, and therefore need this separation. \par The implementation of this methodology does not suffer from many of the problems seen in existing benchmarks: it scales with technology, it is tightly specified, and it helps system designers. This benchmark's only drawbacks are that it does not accurately predict the performance of a filesystem on a workload, thus limiting its applicability: it is useful to system designers, but not for users trying to decide which system to buy. The belief is that the general approach will work, given additional time to manipulate the prediction algorithm. %X tr-19-95.ps.gz %R TR-20-95 %D 1995 %T Collaborative Plans for Complex Group Action %A Barbara J. Grosz %A Sarit Kraus %X The original formulation of SharedPlans was developed to provide a model of collaborative planning in which it was not necessary for one agent to have intentions-to toward an act of a different agent. Unlike other contemporaneous approaches, this formulation provided for two agents to coordinate their activities without introducing any notion of irreducible joint intentions. However, it only treated activities that directly decomposed into single-agent actions, did not address the need for agents to commit to their joint activity, and did not adequately deal with agents having only partial knowledge of the way in which to perform an action. This paper provides a revised and expanded version of SharedPlans that addresses these shortcomings. It also reformulates Pollack's definition of individual plans to handle cases in which a single agent has only partial knowledge; this reformulation meshes with the definition of SharedPlans. The new definitions also allow for contracting out certain actions. The formalization that results has the features required by Bratman's account of shared cooperative activity and is more general than alternative accounts. %X tr-20-95.ps.gz %R TR-21-95 %D 1995 %T Instructions for Annotating Discourse %A Christine H. Nakatani %A Barbara J. Grosz %A David D. Ahn %A Julia Hirschberg %X tr-21-95.ps.gz %R TR-22-95 %D 1995 %T Finding the Largest Rectangle in Several Classes of Polygons %A Karen Daniels %A Victor J. Milenkovic %A Dan Roth %X This paper considers the geometric optimization problem of finding the Largest area axis-parallel Rectangle (LR) in an $n$-vertex general polygon. We characterize the LR for general polygons by considering different cases based on the types of contacts between the rectangle and the polygon. A general framework is presented for solving a key subproblem of the LR problem which dominates the running time for a variety of polygon types. This framework permits us to transform an algorithm for orthogonal polygons into an algorithm for nonorthogonal polygons. Using this framework, we obtain the following LR time results: $\Theta(n)$ for $xy$-monotone polygons, ${\rm O}(n \alpha(n))$ for orthogonally convex polygons, (where $\alpha(n)$ is the slowly growing inverse of Ackermann's function), ${\rm O}(n \alpha(n) \log n)$ for horizontally (vertically) convex polygons, ${\rm O}(n \log n)$ for a special type of horizontally convex polygon (whose boundary consists of two $y$-monotone chains on opposite sides of a vertical line), and ${\rm O}(n \log^2 n)$ for general polygons (allowing holes). For all these types of non-orthogonal polygons, we match the running time of the best known algorithms for their orthogonal counterparts. A lower bound of time in $\Omega(n \log n)$ is established for finding the LR in both self-intersecting polygons and general polygons with holes. The latter result gives us both a lower bound of $\Omega(n \log n)$ and an upper bound of ${\rm O}(n \log^2 n)$ for general polygons. %X tr-22-95.ps.gz %R TR-23-95 %D 1995 %T Performance Issues in Correlated Branch Prediction Schemes %A Nicolas Gloy %A Michael D. Smith %A Cliff Young %X Accurate static branch prediction is the key to many techniques for exposing, enhancing, and exploiting Instruction Level Parallelism (ILP). The initial work on static correlated branch prediction (SCBP) demonstrated improvements in branch prediction accuracy, but did not address overall performance. In particular, SCBP expands the size of executable programs, which negatively affects the performance of the instruction memory hierarchy. Using the profile information available under SCBP, we can minimize these negative performance effects through the application of code layout and branch alignment techniques. We evaluate the performance effect of SCBP and these profile-driven optimizations on instruction cache misses, branch mispredictions, and branch misfetches for a number of recent processor implementations. We find that SCBP improves performance over (traditional) per-branch static profile prediction. We also find that SCBP improves the performance benefits gained from branch alignment. As expected, SCBP gives larger benefits on machine organizations with high mispredict/misfetch penalties and low cache miss penalties. Finally, we find that the application of profile-driven code layout and branch alignment techniques (without SCBP) can improve the performance of the dynamic correlated branch prediction techniques. %X tr-23-95.ps.gz %R TR-24-95 %D 1995 %T Randomized, Oblivious, Minimal Routing Algorithms for Multicomputers %A Ted Nesson %X Efficient data motion has been critical in high performance computing for as long as computers have been in existence. Massively parallel computers use a sparse interconnection network between processing nodes with local memories. Minimizing the potential for high congestion of communication links is an important goal in the design of routing algorithms and interconnection networks in these systems. \par In these distributed--memory architectures, the communication system represents a significant portion of the total system cost, but is nevertheless often a weak link in the system with respect to performance. Efficient interprocessor communication is one of the most important and most challenging problems associated with massively parallel computing. Communication delays can easily represent a large fraction of the total running time, inhibiting high performance computing for a wide range of problems. Efficient use of the communication system is the focus of this thesis. \par The design of the interconnection network and the routing algorithms used to transport data between nodes are critical to overall system performance. The constraints imposed by a sparse interconnection network suggest that preserving locality of reference through careful data allocation and minimizing network load by using minimal algorithms are desirable objectives. \par In this thesis, we present ROMM, a new class of general--purpose message routing algorithms for large--scale, distributed--memory multicomputers. ROMM is a class of Randomized, Oblivious, Multi--phase, Minimal routing algorithms. We will show that ROMM routing offers the potential for improved performance compared to both fully randomized algorithms and deterministic oblivious algorithms, under both light and heavy loads. ROMM routing also offers close to best--case performance for many common routing tasks. These claims are supported by extensive analysis and simulation of ROMM routing on several different interconnection network architectures, for a set of representative routing tasks. Furthermore, our results show that non--minimality and adaptivity, two common techniques for reducing congestion, are not always required for good routing performance. %X tr-24-95.ps.gz %R TR-25-95 %D 1995 %T Translational Polygon Containment and Minimal Enclosure using Geometric Algorithms and Mathematical Programming %A Victor J.Milenkovic %A Karen M. Daniels %X We present an algorithm for the two-dimensional translational {\em containment} problem: find translations for $k$ polygons (with up to $m$ vertices each) which place them inside a polygonal container (with $n$ vertices) without overlapping. The polygons and container may be nonconvex. The containment algorithm consists of new algorithms for {\em restriction}, {\em evaluation}, and {\em subdivision} of two-dimensional configuration spaces. The restriction and evaluation algorithms both depend heavily on linear programming; hence we call our algorithm an {\em LP containment algorithm}. Our LP containment algorithm is distinguished from previous containment algorithms by the way in which it applies principles of mathematical programming and also by its tight coupling of the evaluation and subdivision algorithms. Our new evaluation algorithm finds a local overlap minimum. Our distance-based subdivision algorithm eliminates a ``false'' (local but not global) overlap minimum and all layouts near that overlap minimum, allowing the algorithm to make progress towards the global overlap minimum with each subdivision. /par In our experiments on data sets from the apparel industry, our LP algorithm can solve containment for up to ten polygons in a few minutes on a desktop workstation. Its practical running time is better than our previous containment algorithms and we believe it to be superior to all previous translational containment algorithms. Its theoretical running time, however, depends on the number of local minima visited, which is $\bigo((6kmn+k^2m^2)^{2k+1}/k!)$. To obtain a better theoretical running time, we present a modified (combinatorial) version of LP containment with a running time of \[ \bigo\left(\frac{(6kmn+k^2m^2)^{2k}}{(k-5)!} \log kmn \right), \] which is better than any previous combinatorial containment algorithm. For constant $k$, it is within a factor of $\log mn$ of the lower bound. /par We generalize our configuration space containment approach to solve {\em minimal enclosure} problems. We give algorithms to find the minimal enclosing square and the minimal area enclosing rectangle for $k$ translating polygons. Our LP containment algorithm and our minimal enclosure algorithms succeed by combining rather than replacing geometric techniques with linear programming. This demonstrates the manner in which linear programming can greatly increase the power of geometric algorithms. %X tr-25-95.ps.gz %R TR-26-95 %D 1995 %T Kernel Instrumentation Tools and Techniques %A J. Bradley Chen %A Alan Eustace %X Atom is a powerful platform for the implementation of profiling, debugging and simulation tools. Kernel support in ATOM makes it possible to implement similar tools for the Digital UNIX kernel. We describe four non-trivial Atom kernel tools which demonstrate the support provided in Atom for kernel work as well as the range of application of Atom kernel tools. We go on to discuss some techniques that are generally useful when using Atom with the kernel. Prior techniques restrict kernel measurements to the domain of exotic systems research. We hope Atom technology will make kernel instrumentation and measurement practical for a much larger community of researchers. %X tr-26-95.ps.gz %R TR-27-95 %D 1995 %T On the Transportation and Distribution of Data Structures in Parallel and Distributed Systems %A Amr F. Fahmy %A Robert A. Wagner %X We present algorithms for the transportation of data in parallel and distributed systems that would enable programmers to transport or distribute a data structure by issuing a function call. Such a functionality is needed if programming distributed memory systems is to become commonplace. \par The distribution problem is defined as follows. We assume that $n$ records of a data structure are scattered among $p$ processors where processor $q_i$ holds $r_{i}$ records, $1 \leq i \leq p$. The problem is to redistribute the records so that each processor holds $\lfloor n/p \rfloor$ records. We solve the problem in the minimum number of parallel data-permutation operations possible, for the given initial record distribution. This means that we use $max( mxr - \lfloor n/p \rfloor, \lfloor n/p \rfloor - mnr )$ parallel data transfer steps, where $mxr = max(r_{i})$ and $ mnr = min(r_{i})$ for $1 \leq i \leq p$. \par Having solved the distribution problem, it then remains to transport the data structure from the memory of one processor to another. In the case of dynamically allocated data structures, we solve the problem of renaming pointers by creating an intermediate name space. We also present a transportation algorithm that attempts to hide the cost of making a local copy for the data structure which, is necessary since the data structure could be scattered in the memory of the sender. %X tr-27-95.ps.gz %R TR-28-95 %D 1995 %T Learning to take Actions %A Roni Khardon %X We formalize a model for supervised learning of action strategies in dynamic stochastic domains and show that PAC-learning results on Occam algorithms hold in this model as well. We then identify a class of rule based action strategies for which polynomial time learning is possible. The representation of strategies is generalization of decision lists; strategies include rules with existentially quantified conditions, simple recursive predicates, and small internal state, but are syntactically restricted. We also study the learnability of hierarchically composed strategies where a subroutine already acquired can be used as a basic action in a higher level strategy. We prove some positive results in this setting, but also show that in some cases the hierarchical learning problem is computationally hard. %X tr-28-95.ps.gz %R TR-29-95 %D 1995 %T Network Related Performance Issues and Techniques for MPPs %A S. Lennart Johnsson %X In this paper we review network related performance issues for current Massively Parallel Processors (MPPs) in the context of some important basic operations in scientific and engineering computation. The communication system is one of the most performance critical architectural components of MPPs. In particular, understanding the demand posed by collective communication is critical in architectural design and system software implementation. We discuss collective communication and some implementation techniques therefore on electronic networks. Finally, we give an example of a novel general routing technique that exhibits good scalability, efficiency and simplicity in electronic networks. %X tr-29-95.ps.gz %R TR-30-95 %D 1995 %T Efficient Learning from Faulty Data %A Scott Evan Decatur %X Learning systems are often provided with imperfect or noisy data. Therefore, researchers have formalized various models of learning with noisy data, and have attempted to delineate the boundaries of learnability in these models. In this thesis, we describe a general framework for the construction of efficient learning algorithms in noise tolerant variants of Valiant's PAC learning model. By applying this framework, we also obtain many new results for specific learning problems in various settings with faulty data.

The central tool used in this thesis is the specification of learning algorithms in Kearns' Statistical Query (SQ) learning model, in which statistics, as opposed to labelled examples, are requested by the learner. These SQ learning algorithms are then converted into PAC algorithms which tolerate various types of faulty data.

We develop this framework in three major parts:

- We design automatic compilations of SQ algorithms into PAC algorithms which tolerate various types of data errors. These results include improvements to Kearns' classification noise compilation, and the first such compilations for malicious errors, attribute noise and new classes of ``hybrid'' noise composed of multiple noise types.
- We prove nearly tight bounds on the required complexity of SQ algorithms. The upper bounds are based on a constructive technique which allows one to achieve this complexity even when it is not initially achieved by a given SQ algorithm.
- We define and employ an improved model of SQ learning which yields noise tolerant PAC algorithms that are more efficient than those derived from standard SQ algorithms.

| [ XX' XY' ] [ A B ] |2 min | [ ] - [ ] | Y | [ YX' YY' ] [ B' C ] |Fwhere A = XX', and B, C are matrices of inner products calculated from the estimated distances. \par The vanishing of the gradient of the STRAIN is shown to be equivalent to a system of only six nonlinear equations in six unknowns for the inertial tensor associated with the solution Y. The entire solution space is characterized in terms of the geometry of the intersection curves between the unit sphere and certain variable ellipsoids. Upon deriving tight bilateral bounds on the moments of inertia of any possible solution, we construct a search procedure that reliably locates the global minimum. The effectiveness of this method is demonstrated on realistic simulated and chemical test problems. %X tr-04-97.ps.gz %R TR-05-97 %D 1997 %T Development of a Systematic Approach to Bottleneck Identification in UNIX systems %A Lori Park %X tr-05-97.ps.gz %R TR-06-97 %D 1997 %T A Revisitation of Kernel Synchronization Schemes %A Christopher Small %A Stephen Manley %X In an operating system kernel, critical sections of code must be protected from interruption. This is traditionally accomplished by masking the set of interrupts whose handlers interfere with the correct operation of the critical section. Because it can be expensive to communicate with an off-chip interrupt controller, more complex optimistic techniques for masking interrupts have been proposed. \par In this paper we present measurements of the behavior of the NetBSD 1.2 kernel, and use the measurements to explore the space of kernel synchronization schemes. We show that (a) most critical sections are very short, (b) very few are ever interrupted, (c) using the traditional synchronization technique, the synchronization cost is often higher than the time spent in the body of the critical section, and (d) under heavy load NetBSD 1.2 can spend 9% to 12% of its time in synchronization primitives. \par The simplest scheme we examined, disabling all interrupts while in a critical section or interrupt handler, can lead to loss of data under heavy load. A more complex optimistic scheme functions correctly under the heavy workloads we tested and has very low overhead (at most 0.3%). Based on our measurements, we present a new model that offers the simplicity of the traditional scheme with the performance of the optimistic schemes. \par Given the relative CPU, memory, and device performance of today's hardware, the newer techniques we examined have a much lower synchronization cost than the traditional technique. Under heavy load, such as that incurred by a web server, a system using these newer techniques will have noticeably better performance. %X tr-06-97.ps.gz %R TR-07-97 %D 1997 %T An IRAM-Based Architecture for a Single-Chip ATM Switch %A A. Brown %A I. Papaefstathiou %A J. Simer %A D. Sobel %A J. Sutaria %A S. Wang %A T. Blackwell %A M. Smith %A W. Yang %X We have developed an architecture for an IRAM-based ATM switch that is implemented with merged DRAM and logic for a cost of about $100. The switch is based on a shared-buffer memory organization and is fully non-blocking. It can support a total aggregate throughput of 1.2 gigabytes per second, organized in any combination of up to 32 155 Mb/sec, eight 622 Mb/sec, or four 1.2 Gb/sec full-duplex links. The switch can be fabricated on a single chip, and includes an internal 4 MB memory buffer capable of storing over 85,000 cells. When combined with external support circuitry, the switch is competitive with commercial offerings in its feature set, and significantly less expensive than existing solutions. The switch is targeted to WAN infrastructure applications such as wide-area Internet access, data backbones, and digital telephony, where we feel untapped markets exist, but it is also usable for ATM-based LANs and even could be modified to penetrate the potentially lucrative Fast and Gigabit Ethernet markets. %X tr-07-97.ps.gz %R TR-08-97 %D 1997 %T Evaluation of Two Connectionist Approaches to Stack Representation %A Rebecca Hwa %X This study empirically compares two distributed connectionist learning models trained to represent an arbitrarily deep stack. One is Pollack's Recursive Auto-Associative Memory, a recurrent, back-propagating neural network that uses a hidden intermediate representation. The other is the Exponential Decay Model, a novel framework we propose here, that trains the network to explicitly model the stack as an exponentially decaying entity. We show that although the concept of a stack is learnable for both approaches, neither is able to deliver the arbitrary depth attribute. Ultimately, both suffer from the rapid rate of error propagation inherent in their recursive structures. %X tr-08-97.ps.gz %R TR-09-97 %D 1997 %T Learning Action Strategies for Planning Domains %A Roni Khardon %X This paper reports on experiments where techniques of supervised machine learning are applied to the problem of planning. The input to the learning algorithm is composed of a description of a planning domain, planning problems in this domain, and solutions for them. The output is an efficient algorithm - a strategy - for solving problems in that domain. We test the strategy on an independent set of planning problems from the same domain, so that success is measured by its ability to solve complete problems. A system, L2Act, has been developed in order to perform these experiments. \par We have experimented with the blocks world domain, and the logistics domain, using strategies in the form of a generalization of decision lists, where the rules on the list are existentially quantified first order expressions. The learning algorithm is a variant of Rivest`s (1987) algorithm, improved with several techniques that reduce its time complexity. As the experiments demonstrate, generalization is achieved so that large unseen problems can be solved by the learned strategies. The learned strategies are efficient and are shown to find solutions of high quality. We also discuss preliminary experiments with linear threshold algorithms for these problems. %X tr-09-97.ps.gz %R TR-10-97 %D 1997 %T L2Act User Manual %A Roni Khardon %X This note describes the system L2Act, the options it includes, and how to use it. We assume knowledge of the general ideas behind the system, as well as some details on the implementation described in TR-28-95 and TR-09-97. %X tr-10-97.ps.gz %R TR-11-97 %D 1997 %T Similarity-Based Approaches to Natural Language Processing %A Lillian Jane Lee %X Statistical methods for automatically extracting information about associations between words or documents from large collections of text have the potential to have considerable impact in a number of areas, such as information retrieval and natural-language-based user interfaces. However, even huge bodies of text yield highly unreliable estimates of the probability of relatively common events, and, in fact, perfectly reasonable events may not occur in the training data at all. This is known as the {\em sparse data problem}. Traditional approaches to the sparse data problem use crude approximations. We propose a different solution: if we are able to organize the data into classes of similar events, then, if information about an event is lacking, we can estimate its behavior from information about similar events. This thesis presents two such similarity-based approaches, where, in general, we measure similarity by the Kullback-Leibler divergence, an information-theoretic quantity. \par Our first approach is to build soft, hierarchical clusters: soft, because each event belongs to each cluster with some probability; hierarchical, because cluster centroids are iteratively split to model finer distinctions. Our clustering method, which uses the technique of deterministic annealing, represents (to our knowledge) the first application of soft clustering to problems in natural language processing. We use this method to cluster words drawn from 44 million words of Associated Press Newswire and 10 million words from Grolier's encyclopedia, and find that language models built from the clusters have substantial predictive power. Our algorithm also extends with no modification to other domains, such as document clustering. \par Our second approach is a nearest-neighbor approach: instead of calculating a centroid for each class, we in essence build a cluster around each word. We compare several such nearest-neighbor approaches on a word sense disambiguation task and find that as a whole, their performance is far superior to that of standard methods. In another set of experiments, we show that using estimation techniques based on the nearest-neighbor model enables us to achieve perplexity reductions of more than 20 percent over standard techniques in the prediction of low-frequency events, and statistically significant speech recognition error-rate reduction. %X tr-11-97.ps.gz %R TR-12-97 %D 1997 %T An Analysis of Issues Facing World Wide Web Servers %A Stephen Lee Manley %X The World Wide Web has captured the public's interest like no other computer aplication or tool. In response, businesses have attempted to capitalize on the Web's popularity. As a result propaganda, assumption, and unfounded theories have taken the place of facts, scientific analysis, and well-reasoned theories. As with all things, the Web's popularity comes with a price for the first time the computer industry must satisfy exponentially increasing demand. As the World Wide Web becomes the "World Wide Wait" and the Internet changes from the "Information Superhighway" to a "Giant Traffic Jam," the public demands improvements in Web performance. /par The lack of cogent scientific analysis prevents true improvement in Web conditions. Nobody knows the source of the bottlenecks. Some assert that the server must be the problem. Others blame Internet congestion. Still others place the blame on modems or slow Local Area Networks. The Web's massive size and growth have made research difficult, but those same factors make such work indispensable. /par This thesis examines issues facing the Web by focusing on traffic patterns on a variety of servers. The thesis presents a method of categorizing different Web site growth patterns. It then disproves the theory that CGI has become an important and varied tool on most Web sites. Most importantly, however, the thesis focuses on the source of latency on the Web. An in-depth examination of the data leads to the conclusion that the server cannot be a primary source of latency on the World Wide Web. \par The thesis then details the creation of a new realistic, self-configuring, scaling Web server benchmark. By using a site's Web server logs, the benchmark can create a model of the site's traffic. The model can be reduced by a series of abstractions, and scaled to predict future behavior. Finally, the thesis shows the benchmark models realistic Web server traffic, and can serve as a tool for scientific analysis of developments on the Web, and their effects on the server. %X tr-12-97.ps.gz %R TR-13-97 %D 1997 %T Data Parallel Performance Optimizations Using Array Aliasing %A Y. Charlie Hu %A S. Lennart Johnsson %X The array aliasing mechanism provided in the Connection Machine Fortran (CMF) language and run--time system provides a unique way of identifying the memory address spaces local to processors within the global address space of distributed memory architectures, while staying in the data parallel programming paradigm. We show how the array aliasing feature can be used effectively in optimizing communication and computation performance. The constructs we present occur frequently in many scientific and engineering applications, and include various forms of aggregation and array reshaping through array aliasing. The effectiveness of the optmization techniques is demonstrated on an implementation of Anderson's hierarchical $O(N)$$ $N$--body method. %X tr-13-97.ps.gz %R TR-14-97 %D 1997 %T Efficient Data Parallel Implementations of Highly Irregular Problems %A Yu Hu %X This dissertation presents optimization techniques for efficient data parallel formulation/implementation of highly irregular problems, and applies the techniques to $O(N)$ hierarchical \Nbody methods for large--scale \Nbody simulations. It demonstrates that highly irregular scientific and engineering problems such as nonadaptive and adaptive $O(N)$ hierarchical \Nbody methods can be efficiently implemented in high--level data parallel languages such as High Performance Fortran (HPF) on scalable parallel architectures. It also presents an empirical study of the accuracy--cost tradeoffs of $O(N)$ hierarchical \Nbody methods. \par This dissertation first develops optimization techniques for efficient data parallel implementation of irregular problems, focusing on minimizing the data movement through careful management of the data distribution and the data references, both between the memories of different nodes, and within the memory hierarchy of each node. For hierarchical \Nbody methods, our optimizations on improving arithmetic efficiencies include recognizing dominating computations as matrix--vector multiplications and aggregating them into multiple--instance matrix--matrix multiplications. Experimental results with an implementation in Connection Machine Fortran of Anderson's hierarchical \Nbody method demonstrate that performance competitive to that of the best message--passing implementations of the same class of methods can be achieved. \par The dissertation also presents a general data parallel formulation for highly irregular applications, and applies the formulation to an adaptive hierarchical \Nbody method with highly nonuniform particle distributions. The formulation consists of (1) a method for linearizing irregular data structures, (2) a data parallel implementation (in HPF) of graph partitioning algorithms applied to the linearized data structure, and (3) techniques for expressing irregular communications and nonuniform computations associated with the elements of linearized data structures. Experimental results demonstrate that efficient data parallel (HPF) implementations of highly nonuniform problems are feasible with proper language/compiler/runtime support. Our data parallel $N$--body code provides a much needed ``benchmark'' code for evaluating and improving HPF compilers. \par This thesis also develops the first data parallel (HPF) implementation of the geometric partitioning algorithm due to Miller, Teng, Thurston and Vavasis -- one of the only two provably good partitioning schemes. Our data parallel formulation makes extensive use of segmented prefix sums and parallel selections, and provides a data parallel procedure for geometric sampling. Experiments on partitioning particles for load--balance and data interactions as required in hierarchical \Nbody algorithms show that the geometric partitioning algorithm has an efficient data parallel formulation. \par Finally, this thesis studies the accuracy--cost tradeoffs of $O(N)$ hierarchical \Nbody methods using our implementation of nonadaptive Anderson's method. The various parameters which control the degree of approximation of the computational elements and separateness of the interacting computational elements, govern both the arithmetic complexity and the accuracy of the methods. A scheme for choosing optimal parameters that give the best running time for a prescribed error requirement is developed. Using this scheme, we find that for a prescribed error, using a near--field containing only nearest neighbor boxes and the optimal hierarchy depth which minimizes the number of arithmetic operations, minimizes the number of arithmetic operations and therefore the total running time. %X tr-14-97.ps.gz %R TR-15-97 %D 1997 %T The Computational Processing of Intonational Prominence: A Functional Prosody Perspective %A Christine Hisayo Nakatani %X tr-15-97.ps.gz %R TR-16-97 %D 1997 %T Does Systems Research Measure Up? %A Christopher Small %A Narendra Ghosh %A Hany Saleeb %A Margo Seltzer %A Keith Smith %X We surveyed more than two hundred systems research papers published in the last six years, and found that, in experiment after experiment, systems researchers measure the same things, but in the majority of cases the reported results are not reproducible, comparable, or statistically rigorous. In this paper we present data describing the state of systems experimentation and suggest guidelines for structuring commonly run experiments, so that results from work by different researchers can be compared more easily. We conclude with recommendations on how to improve the rigor of published computer systems research. %X tr-16-97.ps.gz %R TR-17-97 %D 1997 %T A Self-Scaling and Self-Configuring Benchmark for Web Servers %A Stephen Manley, Network Appliance %A Michael Courage, Microsoft Corporation %A Margo Seltzer, Harvard University %X World Wide Web clients and servers have become some of the most important applications in our computing base today, and as such, we need realistic and meaningful ways of capturing their performance. Current server benchmarks do not capture the wide variation that we see in servers and are not accurate in their characterization of web traffic. In this paper, we present a self-configuring, scalable benchmark, that generates a server benchmark load based on actual server loads. In contrast to other web benchmarks, our benchmark characterizes request latency instead of focusing exclusively on throughput sensitive metrics. We present our new benchmark, hbench:Web, and demonstrate how it accurately captures the load observed by an actual server. We then go on to show how it can be used to assess how continued growth or changes in the workload will affect future performance. Using existing log histories, we show that these predictions are sufficiently realistic to provide insight into tomorrow's web performance. %X tr-17-97.ps.gz %R TR-18-97 %D 1997 %T Issues in Extensible Operating Systems %A Margo I. Seltzer %A Yasuhiro Endo %A Christopher Small %A Keith A. Smith %X Operating systems research has traditionally consisted of adding functionality to the operating system or inventing and evaluating new methods for performing functions. Regardless of the research goal, the single constant has been that the size and complexity of operating systems increase over time. As a result, operating systems are usually the single most complex piece of software in a computer system, containing hundreds of thousands, if not millions, of lines of code. Today's operating system research is directed at finding new ways to structure the operating system in order to increase its flexibility, allowing it to adapt to changes in the application set it must support. This paper discusses the issues involved in designing such extensible systems and the array of choices facing the operating system designer. We present a framework for describing extensible operating systems and then relate current operating systems to this framework. %X tr-18-97.ps.gz %R TR-19-97 %D 1997 %T Projection Learning %A Leslie G. Valiant %X A method of combining learning algorithms is described that preserves attribute efficiency. It yields learning algorithms that require a number of examples that is polynomial in the number of relevant variables and logarithmic in the number of irrelevant ones. The algorithms are simple to implement and realizable on networks with a number of nodes linear in the total number of variables. They can be viewed as strict generalizations of Littlestone's Winnow algorithm, and, therefore, appropriate to domains having very large numbers of attributes, but where nonlinear hypotheses are sought. %X tr-19-97.ps.gz %R TR-20-97 %D 1997 %T The Mug-Shot Search Problem %A Ellie Baker %A Margo Seltzer %X Mug-shot search is the classic example of the general problem of searching a large facial image database when starting out with only a mental image of the sought-after face. We have implemented a prototype content-based image retrieval system that integrates composite face creation methods with a face-recognition technique (Eigenfaces) so that a user can both create faces and search for them automatically in a database. These two functions are fully integrated so that interim created com posites may be used to search the data and interim search results may, likewise, be used to modify an evolving composite. \par Although the Eigenface method has been studied extensively for its ability to perform face identification tasks (in which the input to the system is an on-line facial image to identify), little research has been done to determine how effective it is as applied to the mug shot search problem (in which there is no on-line input image at the outset). With our prototype system, we have conducted a pilot user study that looks at the usefulness of eigenfaces as applied to this problem. The study shows that the eigenface method, though helpful, is an imperfect model of human perception of similarity between faces. Using a novel evaluation methodology, we have made progress at identifying specific search strategies that, given an imperfect correlation between the system and human similarity metrics, use whatever correlation does exist to the best advantage. We have also shown that the use of facial composites as query images is advantageous compared to restricting users to database images for their queries. %X tr-20-97.ps.gz %R TR-21-97 %D 1997 %T Quality and Speed in Linear-scan Register Allocation %A Omr Traub %A Glenn Holloway %A Michael Smith %X A linear-scan algorithm directs the global allocation of register candidates to registers based on a simple linear sweep over the program being compiled. This approach to register allocation makes sense for systems, such as those for dynamic compilation, where compilation speed is important. In contrast, most commercial and research optimizing compilers rely on a graph-coloring approach to global register allocation. In this paper, we compare the performance of a linear-scan method against a modern graph-coloring method. We implement both register allocators within the Machine SUIF extension of the Stanford SUIF compiler system. Experimental results show that linear scan is much faster than coloring on benchmarks with large numbers of register candidates. We also describe improvements to the linear-scan approach that do not change its linear character, but allow it to produce code of a quality near to that produced by graph coloring. %X tr-21-97.ps.gz %R TR-22-97 %D 1997 %T The Evolution of SharedPlans %A Barbara J. Grosz %A Sarit Kraus %X tr-22-97.ps.gz %R TR-23-97 %D 1997 %T Computer-Human Collaboration in the Design of Graphics %A Kathleen Ryall %X Delineating the roles of the user and the computer in a system is a central task in user interface design. As interactive applications become more complex, it is increasingly difficult to design interface methods that deliver the full power of an application to users, while enabling them to learn and to use effectively the interface to a system. The challenge is finding a balance between user intervention and computer control within the computer-user interface. \par In this thesis, we propose a new paradigm for determining this division of labor, which attempts to exploit the strengths of each collaborator, the human user and the computer system. This collaborative framework encourages the development of semi-automatic systems, through which users can explore a large number of candidate solutions, while evaluating and comparing various alternatives. Under the collaborative framework, the problem to be solved is framed as an optimization problem, which is then decomposed into local and global portions. The user is responsible for global aspects of the problem: placing the computer into different areas of the search space, and determining when an acceptable solution has been reached. The computer works at the local level, computing the local minima, displaying results to the user, and providing simple interface mechanisms to facilitate the interaction. Systems employing this approach make use of task-specific information to leverage the actions of users, performing fine-grained details while leaving the high-level aspects to the user specifiable through gross interface gestures. \par We present applications of our collaborative paradigm to the design and implementation of semi-automatic systems for three tasks from the domain of graphic design: network diagram layout, parameter specification for computer graphics algorithms and floor plan segmentation. The collaborative paradigm we propose is well-suited for this domain. Systems designed under our framework support an iterative design process, an integral component of graphic design. Furthermore, the collaborative framework for computer-user interface design exploits people's expertise at incorporating aesthetic criteria and semantic information into finding an acceptable solution for a graphic design task, while harnessing a computer's computational power, to enable users to explore a large space of candidate solutions. %X tr-23-97.ps.gz %R tr-24-97 %D 1997 %T Fast Parallel Orthogonal Transforms %A Nadia Shalaby %X Orthogonal transforms are ubiquitous in engineering and applied science applications. Such applications place stringent requirements on a computer system, either in terms of time, which translates into processing power, or problem size, which translates into memory size, or most often both. To achieve such high performance, these applications have been migrating to the realm of parallel computing, offering significantly more processing power and memory. This trend mandates the development of fast parallel algorithms for orthogonal transforms, that would be versatile on many parallel platforms, and is the subject of this thesis. \par We present a unified approach in seeking fast parallel orthogonal transforms by establishing a theoretical formulation for each algorithm, presenting a simple specification to facilitate implementation and analysis, and by evaluating performance via a set of predefined arithmetic, memory, communication and load--imbalance complexity metrics. We adopt a bottom up approach by constructing the thesis in the form of progressively larger modular blocks, each relying on a lower level block for its formulation, algorithmic specification and performance analysis. \par Since communication is the bottleneck for parallel computing, we first establish an algebraic framework for stable parallel permutations, the predominant communication requirement for parallel orthogonal transforms. We present a taxonomy categorizing stable permutations into classes of index--digit, linear, translation, affine and polynomial permutations. For each class, we demonstrate its general behavioral properties and prove permutation locality and other properties for particular examples. \par These results are then applied in formulating, specifying and evaluating performance of the direct and load--balanced parallel algorithms for the fast Fourier transform (FFT); we prove the latter to be optimal. We demonstrate the versatility of our complexity metrics by substituting them into the PRAM, LPRAM, BSP and LogP computational models. Consequently, we employ the load--balanced FFT in deriving a novel polynomial--based discrete cosine transform (DCT) and demonstrate its performance advantage over the classical DCT. Finally, the polynomial DCT is used as a building block for a novel parallel fast Legendre transform (FLT), a parallelization of the Driscoll--Healey O(N log^2 N) method. We show that the algorithm is hierarchically load--balanced by composing its specification and complexity from its modular blocks. %X tr-24-97.ps.gz %R TR-01-98 %D 1998 %T A Seed-Growth Heuristic for Graph Bisection %A Joe Marks %A Wheeler Ruml %A Stuart M. Shieber %A J.Thomas Ngo %X We present a new heuristic algorithm for graph bisection, based on an implicit notion of clustering. We describe how the heuristic can be combined with stochastic search procedures and a postprocess application of the Kernighan-Lin algorithm. In a series of time-equated comparisons with large-sample runs of pure Kernighan-Lin, the new algorithm demonstrates significant superiority in terms of the best bisections found. %X tr-01-98.ps.gz %R TR-02-98 %D 1998 %T Activity Graph %A Michael Karr %X This paper discusses activity graphs, the mathematical formalism underlying the Activity Coordination System, a process-centered system for collaborative work. The activity graph model was developed for the purpose of describing, tracking, and guiding distributed, communication-intensive activities. Its graph-based semantics encompasses the more familiar Petri nets, but has several novel properties. For example, it is possible to impose multiple hierarchies on the same graph, so that the hierarchy with which an activity is described does not have to be the one with which it is viewed. The paper also discusses very briefly some aspects of the system implementation. %X tr-02-98.ps.gz %R TR-03-98 %D 1998 %T An Activity Coordination System %A Michael Karr %A Thomas E. Cheatham, Jr. %X tr-03-98.ps.gz %R TR-04-98 %D 1998 %T A Neuroidal Architecture for Cognitive Computation %A Leslie G. Valiant %X An architecture is described for designing systems that acquire and manipulate large amounts of unsystematized, or so-called commonsense, knowledge. Its aim is to exploit to the full those aspects of computational learning that are known to offer powerful solutions in the acquisition and maintenance of robust knowledge bases. The architecture makes explicit the requirements on the basic computational tasks that are to be performed and is designed to make these computationally tractable even for very large databases. The main claims are that (i) the basic learning tasks are tractable and (ii) tractable learning offers viable approaches to a range of issues that have been previously identified as problematic for artificial intelligence systems that are entirely programmed. In particular, attribute efficiency holds a central place in the definition of the learning tasks, as does also the capability to handle relational information efficiently. Among the issues that learning offers to resolve are robustness to inconsistencies, robustness to incomplete information and resolving among alternatives. %X tr-04-98.ps.gz %R TR-05-98 %D 1998 %T E-L Rational %A Glenn Holloway %A Steve Squires %X REPORT WITHDRAWN %R TR-06-98 %D 1998 %T An Empirical Evaluation of Probabilistic Lexicalized Tree Insertion Grammer %A Rebecca Hwa %X We present an empirical study of the applicability of Probabilistic Lexicalized Tree Insertion Grammars (PLTIG), a lexicalized counterpart to Probabilistic Context-Free Grammars (PCFG), to problems in stochastic natural-language processing. Comparing the performance of PLTIGs with non-hierarchical $N$-gram models and PCFGs, we show that PLTIG combines the best aspects of both, with language modeling capability comparable to $N$-grams, and improved parsing performance over its non-lexicalized counterpart. Furthermore, training of PLTIGs displays faster convergence than PCFGs. %X tr-06-98.ps.gz %R TR-07-98 %D 1998 %T Parsing Inside-Out %A Joshua Goodman %X The inside-outside probabilities are typically used for reestimating Probabilistic Context Free Grammars (PCFGs), just as the forward-backward probabilities are typically used for reestimating HMMs. In this thesis, I show several novel uses, including improving parser accuracy by matching parsing algorithms to evaluation criteria; speeding up DOP parsing by 500 times; and 30 times faster PCFG thresholding at a given accuracy level. I also give an elegant, state-of-the-art grammar formalism, which can be used to compute inside-outside probabilities; and a parser description formalism, which makes it easy to derive inside-outside formulas and many others. %X tr-07-98.ps.gz %R TR-08-98 - This report has been withdrawn %D 1998 %R TR-09-98 %D 1998 %T A Statistical Analysis of User-Specific Profiles %A Zheng Wang %A Norm Rubin %X This technical report examines common assumptions about computer users in profile-based optimization. We study execution profiles of interactive applications on Windows NT to understand how different users use the same program. The profiles were generated by the DIGITAL FX!32 emulator/binary translator system, which automatically runs the x86 version of Windows NT programs on NT/Alpha computers. Our statistical analysis indicates that people use the benchmark programs in different ways. This technical report is a supplement to the paper "Evaluating the Importance of User-Specific Profiling," to appear in "Proceedings of the 2nd USENIX Windows NT Symposium," USENIX Association, August 1998. %X tr-09-98.ps.gz %R TR-10-98 %D 1998 %T An Empirical Study of Smoothing Techniques for Language Modeling %A Stanley F. Chen %A Joshua Goodman %X We present a tutorial introduction to $n$-gram models for language modeling and survey the most widely-used smoothing algorithms for such models. We then present an extensive empirical comparison of several of these smoothing techniques, including those described by Jelinek and Mercer(1980), Katz (1987), Bell Cleary and Witten (1990), Ney, Essen, and Kneser (1994), and Kneser and Ney (1995). We investigate how factors such as training data size, training corpus (e.g., Brown versus Wall Street Journal), count cutoffs, and n-gram order (bigram versus trigram) affect the relative performance of these methods, which is measured through the cross-entropy of test data. Our results show that previous comparisons have not been complete enough to fully characterize smoothing algorithm performance. We introduce methodologies for analyzing smoothing algorithm efficacy in detail, and using these techniques we motivate a novel variation of Kneser-Ney smoothing that consistently outperforms all other algorithms evaluated. Finally, results showing that improved language model smoothing leads to improved speech recognition performance are presented. %X tr-10-98.ps.gz %R TR-11-98 %D 1998 %T Expectation Value of the Lowest of a Set of Randomly Selected Integers %A Adolph Baker %A Ellen Baker %X Consider the set of positive integers 0, 1, 2, ..., D. If we pick N of them at random, where N < (D+1), what is the expectation (or average value) of the lowest-valued of the N picks? We briefly describe the image database search question that gave rise to this problem, and present a proof that the answer is (D-N+1)/(N+1). %X tr-11-98.ps.gz %R TR-12-98 %D 1998 %T CS50 1997 Quantitative Study %A Dan Ellard %X A discussion and analysis of a quantitative study of CS50, Harvard's introductory Computer Science course. It describes a system for passively monitoring and gathering data about the use of various programming utilities by students, gives an overview of the data gathered with this system, and an analysis of the relationship between student work habits and other factors and success in CS50. Finally, it discusses ideas for how this research could be extended and continued. \par This research was performed as part of Dan Ellard's qualifying examination. %Xtr-12-98.ps.gz %R TR-13-98 %D 1998 %T The ANT Architecture -- An Architecture for CS1 %A Daniel J. Ellard %A Penelope A. Ellard %A James M. Megquier %A J. Bradley Chen %A Margo I. Seltzer %X An overview of the pedagogical and philosophical motivations for teaching machine architecture in CS1, and a description of the ANT architecture, which was specifically designed for use in our introductory computer science course. %X tr-13-98.ps.gz %R TR-14-98 %D 1998 %T The ANT Architecture -- An Architecture For CS1 %A Daniel J. Ellard %A Penelope A. Ellard %A James M. Megquier %A J. Bradley Chen %X A description of the ANT architecture, how we use ANT in our CS1 curriculum, and the pedagogical issues that motivated the creation and design of the ANT architecture. %X tr-14-98.ps.gz %R TR-15-98 %D 1998 %T Data Representation and Assembly Language Programming The ANT-97 Architecture %A Daniel J. Ellard %A Penelope A. Ellard %X A tutorial for data representation (principally two's complement notation and arithmetic and ASCII codes) and assembly language programming using the ANT architecture. This tutorial has been used in Harvard's CS1 course (1996-1998). %X tr-15-98.ps.gz %R TR-16-98 %D 1998 %T The Mug-Shot Search Problem A Study of the Eigenface Metric, Search Strategies, and Interfaces in a System for Searching Facial Image Data %A Ellen Jill Baker %X This thesis presents an investigation of methods for conducting an efficient look-up in a pictorial "phonebook" (i.e., a facial image database). Although research on efficient "mug-shot search" is under way, little has yet been done to evaluate the effectiveness of various proposed techniques, and much work remains before systems as practical or ubiquitous as phonebooks are attainable. The thesis describes a prototype system based on the idea of combining a composite face creation method with a face-recognition technique, so that a user may create a facial image and then automatically locate other similar-looking faces in the database. Several methods for evaluating such a system are presented as well as the results and analysis of a user-study employing the methods. \par Three basic system components are considered and evaluated: the metric for determining which faces are most similar in appearance to a given "query" face, the interface for producing the query face, and the search strategy. The data demonstrate that the Eigenface metric is a useful (though imperfect) model of human perception of similarity between faces. The data also show how the lack of agreement among people about which faces are most similar to a query limits what can be reasonably expected from any metric. Via simulation, it is demonstrated that, if indeed there were a single human metric for assessing facial similarity, and if the Eigenface metric correlated perfectly with this human metric, then simple interactive hill-climbing in the space of the database images would be an excellent search strategy, capable of reducing the average number of image inspections required in a search to about 2% of the database. But this superiority of hill-climbing in principle is not sustained in practice, given the observed level of correlation between the Eigenface similarity metric and the "human" one. The average number of image inspections required for the hill-climbing strategy was, in fact, closer to 35% of the database. While this represents an improvement over the 50% required on average for a simple sequential search of the data, it is still insufficient for practical use. However, given the actual performance of the Eigenface metric, the study data show that a non-iterative strategy of constructing a single query image that is a composite of selected features from 100 random database faces is a better approach, reducing the average number of image inspections to about 20% of the database. These and other examples demonstrate and quantify the benefits of an interface in which the Eigenface metric is combined with a composite creation system. %X tr-16-98.ps.gz %R TR-01-99 %D 1999 %T Silhouette Mapping %A Xianfeng Gu %A Steven Gortler %A Hugues Hoppe %A Leonard McMillan %A Benedict J. Brown %A Abraham D. Stone %X Recent image-based rendering techniques have shown success in approximating detailed models using sampled images over coarser meshes. One limitation of these techniques is that the coarseness of the geometric mesh is apparent in the rough polygonal silhouette of the rendering. In this paper, we present a scheme for accurately capturing the external silhouette of a model in order to clip the approximate geometry. \par Given a detailed model, silhouettes sampled from a discrete set of viewpoints about the object are collected into a silhouette map. The silhouette from an arbitrary viewpoint is then computed as the interpolation from three nearby viewpoints in the silhouette map. Pairwise silhouette interpolation is based on a visual hull approximation in the epipolar plane. The silhouette map itself is adaptively simplified by removing views whose silhouettes are accurately predicted by interpolation of their neighbors. The model geometry is approximated by a progressive hull construction, and is rendered using projected texture maps. The 3D rendering is clipped to the interpolated silhouette using stencil planes. %X tr-01-99.ps.gz %R TR-02-99 %D 1999 %T Speculative Pruning for Boolean Satisfiability %A Wheeler Ruml, Adam Ginsburg, Stuart Shieber %X Much recent work on boolean satisfiability has focussed on incomplete algorithms that sacrifice accuracy for improved running time. Statistical predictors of satisfiability do not return actual satisfying assignments, but at least two have been developed that run in linear time. Search algorithms allow increased accuracy with additional running time, and can return satisfying assignments. The efficient search algorithms that have been proposed are based on iteratively improving a random assignment, in effect searching a graph of degree equal to the number of variables. In this paper, we examine an incomplete algorithm based on searching a standard binary tree, in which statistical predictors are used to speculatively prune the tree in constant time. Experimental evaluation on hard random instances shows it to be the first practical incomplete algorithm based on tree search, surpassing even graph-based methods on smaller instances. %X tr-02-99.ps.gz %R TR-03-99 %D 1999 %T Self-Monitoring in VINO %A Hany S. Saleeb %X Computer system performance is a measure of how well the operating system shares hardware and software resources among the various applications that are running on it. The goal of performance monitoring in the VINO extensible operating system is to make recommendations for improving application performance. This is accomplished by collecting system data through a monitoring agent, automatically identifying conditions causing performance degradation, and presenting evidence to support its conclusions. Once the operating system is well monitored, it may be able to tune itself to improve system performance. \par Within the framework of VINO, I describe a system that monitors itself and gathers information about its performance. I show that system monitoring is advantageous for two reasons. First, through the use of two user applications, it is capable of warning designers of application bottlenecks and system degradation. Second, the operating system can dynamically self-adapt its own kernel behavior and policies after monitoring access patterns. Thus, the monitoring system aids the application designer in defining performance limitations and adapts kernel policies to improve overall system performance. %X tr-03-99.ps.gz %R TR-04-99 %D 1999 %T A Collaborative Approach to Newspaper Layout %A Benjamin Lubin %X tr-04-99.ps.gz %R TR-05-99 %D 1999 %T AnGraf Creating Custom Animated Data Graphics %A Daniel Dias %X tr-05-99.ps.gz %R TR-06-99 %D 1999 %T Creating Socially Conscious Agents: Decision-Making in the Context of Group Commitments %A Alyssa Glass %X With growing opportunities for individually motivated agents to work collaboratively to satisfy shared goals, it becomes increasingly important to design agents that can make intelligent decisions in the context of commitments to group activities. In particular, agents need to be able to reconcile their intentions to do group-related actions with other, conflicting actions. In this thesis, I present the framework for the SPIRE experimental system that allows the process of intention reconciliation in team contexts to be simulated and studied. I define a measure of social consciousness and show how it can be incorporated into the SPIRE system. Using SPIRE, I then investigate the effect of infinite and limited time horizons on agents with varying levels of social consciousness, as well as the resulting effect on the utility of the group as a whole. Using these experiments as a basis for theoretic conclusions, I suggest preliminary principles for designers of collaborative agents. %X tr-06-99.ps.gz %R TR-07-99 %D 1999 %T Logging versus Soft Updates: Asynchronous Meta-data Protection in File Systems in File Systems %A Margo Seltzer %A Gregory Ganger %A M. Kirk McKusick %A Keith A. Smith %A Craig Soules %A Christopher Stein %X The UNIX Fast File System (FFS) is probably the most widely-used file system for performance comparisons. However, such comparisons frequently overlook many of the performance enhancements that have been added over the past decade. In this paper, we explore the two most commonly used approaches for improving the performance of meta-data operations and recovery: logging and Soft Updates. \par The commercial sector has moved en masse to logging file systems, as evidenced by their presence on nearly every server platform available today: Solaris, AIX, Digital UNIX, HP-UX, Irix and Windows NT. On all but Solaris, the default file system uses logging. In the meantime, Soft Updates holds the promise of providing stronger reliability guarantees than logging, with faster recovery and superior performance in certain boundary cases. \par In this paper, we explore the benefits of both Soft Updates and logging, comparing their behavior on both microbenchmarks and workload-based macrobenchmarks. We find that logging alone is not sufficient to "solve" the meta-data update problem. If synchronous semantics are required (i.e., meta-data operatinos are durable once the system call returns), then the logging systems cannot realize their full potential. Only when this synchronicity requirement is relaxed can logging systems approach the performance of systems like Soft Updates. Our asynchronous logging and Soft Updates systems perform comparably in most cases. While Soft Updates excels in some meta-data intensive microbenchmarks, it outperforms logging on only two of the four workloads we examined and performs less well on one. %X tr-07-99.ps.gz %R TR-08-99 %D 1999 %T The Asymptotics of Selecting the Shortest of Two, Improved %A Michael Mitzenmacher and Berhold Voecking %X We investigate variations of a novel, recently proposed load balancing scheme based on small amounts of choice. The static (hashing) setting is modeled as a balls-and-bins process. The balls are sequentially placed into bins, with each ball selecting $d$ bins randomly and going to the bin with the fewest balls. A similar dynamic setting is modeled as a scenario where tasks arrive as a Poisson process at a bank of FIFO servers and queue at one for service. Tasks probe a small random sample of servers in the bank and queue at the server with the fewest tasks. /par Recently it has been shown that breaking ties in a fixed, asymmetric fashion, actually improves performance, whereas in all previous analyses, ties were broken randomly. We demonstrate the nature of this improvement using fluid limit models, suggest further improvements, and verify and quantify the improvement through simulations. %X tr-08-99.ps.gz %R TR-09-99 %D 1999 %T Improving Interactive System Performance using TIPME %A Yasuhiro Endo %A Margo Seltzer %X On the vast majority of today's computers, the dominant form of computation is GUI-based user interaction. In such an environment, the user's perception is the final arbiter of performance. Human-factors research shows that a user's perception of performance is affected by unexpectedly long delays. However, most performance-tuning techniques currently rely on throughput-sensitive benchmarks. While these techniques improve the average performance of the system, they do little to detect or eliminate response-time variabilities -- in particular, unexpectedly long delays. \par We introduce a measurement methodology that improves user-perceived performance by helping us to identify and eliminate the causes of the unexpected long response times that users find unacceptable. We describe TIPME (The Interactive Performance Monitoring Environment), a collection of measurement tools that implements this methodology, and we present two case studies that demonstrate its effectiveness. Each of the performance problems we identify drastically affects variability in response time in a mature system, demonstrating that current tuning techniques do not address this class of performance problems. %X tr-09-99.ps.gz %R TR-10-99 %D 1999 %T Seed-Growth Heuristics for Graph Bisection %A Wheeler Ruml %A Joseph Marks %A Stuart M. Shieber %A J. Thomas Ngo %X We investigate a family of algorithms for graph bisection that are based on a simple local connectivity heuristic, which we call seed-growth. We show how the heuristic can be combined with stochastic search procedures and a postprocess application of the Kernighan-Lin algorithm. In a series of time-equated comparisons against large-sample runs of pure Kernighan-Lin, the new algorithms find bisections of the same or superior quality. Their performance is particularly good on structured graphs representing important industrial applications. An appendix provides further favorable comparisons to other published results. Our experimental methodology and extensive empirical results provide a solid foundation for further empirical investigation of graph-bisection algorithms. %X tr-10-99.ps.gz %R TR-11-99 %D 1999 %T Socially Conscious Decision-Making %A Alyssa Glass %A Barbara Grosz %X The growing need for individually motivated agents to work collaboratively to satisfy shared goals has made it increasingly important to design agents that can make intelligent decisions in the context of commitments to group activities. Agents need to reconcile their intentions to do group-related actions with other, conflicting actions. We describe the SPIRE experimental system which allows the process of intention reconciliation in team contexts to be simulated and studied. We define a measure of social consciousness, discuss its incorporation into the SPIRE system, and present several experiments that investigate the interaction in decision-making of measures of group and individual good. In particular, we investigate the effect of infinite and limited time horizons, different task densities, and varying levels of social consciousness on the utility of the group and the individuals it comprises. A key finding is that an intermediate level of social consciousness yields better results than an extreme commitment. We suggest preliminary principles for designers of collaborative agents based on the results. %X tr-11-99.ps.gz %R TR-12-99 %D 1999 %T Improving Interactive System Performance using TIPME %A Yasuhiro Endo %X tr-12-99.ps.gz %R TR-13-99 %D 1999 %T A Resource Management Framework for Central Servers %A David G. Sullivan %A Margo I. Seltzer %X Proportional-share resource management is becoming increasingly important in today's computing environments. In particular, the growing use of the computational resources of central service providers argues for a proportional-share approach that allows clients to obtain resource shares that reflect their relative importance. In such environments, clients must be isolated from one another to prevent the activities of one client from impinging on the resource rights of others. However, such isolation limits the flexibility with which resource allocations can be modified to reflect the actual needs of clients. We present extensions to the lottery-scheduling resource-management framework that increase its flexibility while preserving its ability to provide secure isolation. To demonstrate how this extended framework safely overcomes the limits imposed by existing proportional-share schemes, we have implemented a prototype system that uses the framework to manage CPU time, physical memory, and disk bandwidth. We present the results of experiments that evaluate the prototype, and we show that our framework enables clients of central servers to achieve significant improvements in performance. %X tr-13-99.ps.gz %R TR-14-99 %D 1999 %T Operating System Support for Multi-User, Remote, Graphical Interaction %A Alexander Ya-Li Wong %A Margo Seltzer %X The rising popularity of thin client computing and multi-user, remote, graphical interaction recalls to the fore a range of operating system research issues long dormant, and introduces a number of new directions. \par This paper investigates the impact of operating system design on the performance of thin client service. We contend that the key performance metric for this type of system is user-perceived latency and give a structured approach for investigating operating system design with this criterion in mind. \par In particular, we apply our approach to a quantitative comparison and analysis of Windows NT, Terminal Server Edition (TSE) and Linux with the X Windows System, two popular implementations of thin client service. \par We find that the processor and memory scheduling algorithms in both operating systems are not tuned for thin client service. Under heaavy CPU and memory load, we observed user-perceived latencies of up to 100 times beyond the threshold of perception and even in the idle state these systems induce unneccessary latency. TSE performs particularly poorly despite scheduler modifications to improve interactive responsiveness. We also show that TSE's network protocol outperforms X by up to six times, and also makes use of a bitmap cache which is essential for handling dynamic elements of modern user interfaces and can reduce network load in these cases by up to 2000%. %X tr-14-99.ps.gz %R TR-01-00 %D 2000 %T Selecting Closest Vectors Through Randomization %A Carl Bosley %A Michael O. Rabin %X We consider the problem of finding the closest vectors to a given vector in a large set of vectors, and propose a randomized solution. The method has applications in Automatic Target Recognition (ATR), Web Information Retrieval, and Data Mining. %X tr-01-00.ps.gz %R TR-02-00 %D 2000 %T The Write-Ahead File System: Integrating Kernel and Application Logging %A Chris Stein %X REPORT WITHDRAWN %R TR-03-00 %D 2000 %T Using Multiple Has Functions to Improve IP Lookups %A Michael Mitzenmacher %A Andrei Broder %X High performance Internet routers require a mechanism for very efficient IP address look-ups. Some techniques used to this end, such as binary search on levels, need to construct quickly a good hash table for the appropriate IP prefixes. In this paper we describe an approach for obtaining good hash tables based on using multiple hashes of each input key (which is an IP address). The methods we describe are fast, simple, scalable, parallelizable, and flexible. In particular, in instances where the goal is to have one hash bucket fit into a cache line, using multiple hashes proves extremely suitable. We provide a general analysis of this hashing technique and specifically discuss its application to binary search on levels with prefix expansion. %X tr-03-00.ps.gz %R TR-04-00 %D 2000 %T Quantum versus Classical Learnability %A Rocco Servedio %X This paper studies fundamental questions in computational learning theory from a quantum computation perspective. We consider quantum versions of two well-studied classical learning models: Angluin's model of exact learning from membership queries and Valiant's Probably Approximately Correct (PAC) model of learning from random examples. We give positive and negative results for quantum versus classical learnability. For each of the two learning models described above, we show that any concept class is information-theoretically learnable from polynomially many quantum examples if and only if it is information-theoretically learnable from polynomially many classical examples. In contrast to this information-theoretic equivalence betwen quantum and classical learnability, though, we observe that a separation does exist between {\em efficient} quantum and classical learnability. For both the model of exact learning from membership queries and the PAC model, we show that under a widely held computational hardness assumption for classical computation (the intractability of factoring), there is a concept class which is polynomial-time learnable in the quantum version but not in the classical version of the model. %X tr-04-00.ps.gz %R TR-05-00 %D 2000 %T Multi-Domain Sandboxing: An Overview %A Robert Fischer %X In today's computing world, computer code is most often developed on one computer and run on another. Code is increasingly downloaded and run on a casual basis, as the line between code and data is blurred and executable code is found in web pages, spreadsheets, word processor documents, etc. \par Not having the knowledge or resources to verify the lack of malicious intent of that code, the user must rely on heresay and technological solutions to ensure that casually downloaded code does not damage the user's computer or steal data. \par Building on the past concepts of sandboxing and multi-level security, we propose multi-domain sandboxing. This security system allows programs more flexibility than traditional sandboxing, while preventing them from malicious actions. We propose applications of this new technology to the web, increasing the functionality and security possible in web applications. %X tr-05-00.ps.gz %R TR-06-00 %D 2000 %T Estimating Resemblance of MIDI Documents %A Michael Mitzenmacher %A Sean Owen %X Search engines often employ techniques for determining syntactic similarity of Web pages. Such a tool allows them to avoid returning multiple copies of essentially the same page when a user makes a query. Here we describe our experience extending these techniques to MIDI music files. The music domain requires modification to cope with problems introduced in the musical setting, such as polyphony. Our experience suggests that when used properly these techniques prove useful for determining duplicates and clustering databases in the musical setting as well. %X tr-06-00.ps.gz %R TR-07-00 %D 2000 %T On the Hardness of Finding Optimal Multiple Preset Dictionaries %A Michael Mitzenmacher %XPreset dictionaries for Huffman codes are used effectively in fax transmission and JPEG encoding. A natural extension is to allow multiple preset dictionaries instead of just one. We show, however, that finding optimal multiple preset dictionaries for Huffman and LZ77-based compression schemes is NP-hard. %X tr-07-00.ps.gz %R TR-08-00 %D 2000 %T Towards Compressing Web Graphs %A Micah Adler %A Michael Mitzenmacher %X In this paper, we consider the problem of compressing graphs of the link structure of the World Wide Web. We provide efficient algorithms for such compression that are motivated by recently proposed random graph models for describing the Web. The algorithms are based on reducing the compression problem to the problem of finding a minimum spanning tree in a directed graph related to the original link graph. The performance of the algorithms on graphs generated by the random graph models suggests that by taking advantage of the link structure of the Web, one may achieve significantly better compression than natural Huffman-based schemes. We also provide hardness results demonstrating limitations on natural extensions of our approach. %X tr-08-00.ps.gz %R TR-01-01 %D 2001 %T Communication Timestamps for Filesystem Synchronization %A Russ Cox %A William Josephson %X The problem of detecting various kinds of update conflicts in file system synchronization following a network partition is well-known. All systems of which we are aware use the version vectors of Parker et al. These require O(R*F) storage space for F files shared among R replicas. We propose a number of different methods, the most space-efficient of which uses O(R*F) space in the worst case, but O(R+F) in the expected case. \par To gain experience with the various methods, we implemented a file synchronization tool called Tra. Based on this experience, we discuss the advantages and disadvantages of each particular method. \par Tra itself turns out to be useful for a variety of tasks, including home directory maintenance, operating system installation, and managing offline work. We discuss some of these uses. %X tr-01-01.ps.gz %R TR-02-01 %D 2001 %T Incomplete Tree Search using Adaptive Probing %A Wheeler Ruml %X When not enough time is available to fully explore a search tree, different algorithms will visit different leaves. Depth-first search and depth-bounded discrepancy search, for example, make opposite assumptions about the distribution of good leaves. Unfortunately, it is rarely clear a priori which algorithm will be most appropriate for a particular problem. Rather than fixing strong assumptions in advance, we propose an approach in which an algorithm attempts to adjust to the distribution of leaf costs in the tree while exploring it. By sacrificing completeness, such flexible algorithms can exploit information gathered during the search using only weak assumptions. As an example, we show how a simple depth-based additive cost model of the tree can be learned on-line. Empirical analysis using a generic tree search problem shows that adaptive probing is competitive with systematic algorithms on a variety of hard trees and outperforms them when the node-ordering heuristic makes many mistakes. Results on boolean satisfiability and two different representations of number partitioning confirm these observations. Adaptive probing combines the flexibility and robustness of local search with the ability to take advantage of constructive heuristics. %X tr-02-01.ps.gz %R TR-03-01 %D 2001 %T gNarLI: A Practical Approach to Natural Language Interfaces to Databases %A AJ Shankar %A Wing Yung %X Most attempted natural language interfaces to databases take too general an approach, and so either require large amounts of setup time or do not provide adequate natural language support for a given domain. \par Our approach, gNarLI, is a pragmatic one: it focuses on one small, easily- and well-defined domain at a time (say, an Oscars movie database). Domain definitions consist of simple rule-based actions. gNarLI provides a flexible pattern-matching preprocessor, an intuitive join processor, and language-independent pronoun support ("Who won best actor in 1957?" ... "Where was he born?"), in addition to considerable freedom in mapping rules to one or more portions of a SQL statement. \par All aspects of the program are fully customizable on a domain basis. Appropriately sized domains can be constructed from scratch in less than 15 hours. The ease and speed of domain construction (requiring no programming skills) make gNarLi adaptable to different areas with little difficulty. \par For small domains, gNarli works surprisingly well. As domain size increases, though, its join processor and rules-based approach become less successful. %X tr-03-01.ps.gz %R TR-04-01 %D 2001 %T Assigning Features using Additive Clustering %A Wheeler Ruml %X If the promise of computational modeling is to be fully realized in higher-level cognitive domains such as language processing, principled methods must be developed to construct the semantic representations that serve as these models' input and/or output. In this paper, we propose the use of an established formalism from mathematical psychology, additive clustering, as a means of automatically assigning discrete features to objects using only pairwise similarity data. Similar approaches have not been widely adopted in the past, as existing methods for the unsupervised learning of such models do not scale well to large problems. We propose a new algorithm for additive clustering, based on heuristic combinatorial optimization. Through extensive empirical tests on both human and synthetic data, we find that the new algorithm is more effective than previous methods and that it also scales well to larger problems. By making additive clustering practical, we take a significant step toward scaling connectionist models beyond hand-coded examples. %X tr-04-01.ps.gz %R TR-05-01 %D 2001 %T An Algebraic Approach to File Synchronization %A Norman Ramsey %A El\H{o}d Csirmaz %X We present a sound and complete proof system for reasoning about operations on filesystems. The proof system enables us to specify a file-synchronization algorithm that can be combined with several different conflict-resolution policies. By contrast, previous work builds the conflict-resolution policy into the specification, or worse, does not specify the behavior formally. We present several alternatives for conflict resolution, and we address the knotty question of timestamps. %X tr-05-01.ps.gz %R TR-06-01 %D 2001 %T Variations on Random Graph Models for the Web %A Eleni Drinea %A Mihaela Enachescu %A Michael Mitzenmacher %X In this paper, we introduce variations on random graph models for Web-like graphs. We provide new ways of interpreting previous models, introduce non-linear models that extend previous work, and suggest models based on feedback between users and search engines. %X tr-06-01.ps.gz %R TR-07-01 %D 2001 %T Dynamic Models for File Sizes and Double Pareto Distributions %A Michael Mitzenmacher %X In this paper, we introduce and analyze a new generative user model to explain the behavior of file size distributions. Our Recursive Forest File model combines ideas from recent work by Downey with ideas from recent work on random graph models for the Web. Unlike similar previous work, our Recursive Forest File model allows new files to be created and old files to be deleted over time, and our analysis covers problematic issues such as correlation among file sizes. Moreover, our model allows natural variations where files that are copied or modified are more likely to be copied or modified subsequently. \par Previous empirical work suggests that file sizes tend to have a lognormal body but a Pareto tail. The Recursive Forest File model explains this behavior, yielding a double Pareto distribution, which has a Pareto tail but close to a lognormal body. We believe the Recursive Forest model may be useful for describing other power law phenomena in computer systems as well as other fields. %X tr-07-01.ps.gz %R TR-08-01 %D 2001 %T A Brief History of Generative Models for Power Law and Lognormal Distributions %A Michael Mitzenmacher %X Power law distributions are an increasingly common model for computer science applications; for example, they have been used to describe file size distributions and in- and out-degree distributions for the Web and Internet graphs. Recently, the similar lognormal distribution has also been suggested as an appropriate alternative model for file size distributions. In this paper, we briefly survey some of the history of these distributions, focusing on work in other fields. We find that several recently proposed models have antecedents in work from decades ago. We also find that lognormal and power law distributions connect quite naturally, and hence it is not surprising that lognormal distributions arise as a possible alternative to power law distributions. %X tr-08-01.ps.gz %R TR-09-01 %D 2001 %T %A Steven Gortler %X %R TR-10-01 %D 2001 %T Integrating On-Demand Alias Analysis into Schedulers for Advanced Microprocessors %A Robert Costa %X %R TR-11-01 %D 2001 %T Progressive Profiling: A Methodology Based on Profile Propagation and Selective Profile Collection %A Zheng Wang %X In recent years, Profile-Based Optimization (PBO) has become a key technique in program optimization. In PBO, the optimizer uses information gathered during previous program executions to guide the optimization process. Even though PBO has been implemented in many research systems and some software companies, there has been little research on how to make PBO effective in practice. \par In today's software industry, one major hurdle in applying PBO is the conflict between the need for high-quality profiles and the lack of time for long profiling runs. For PBO to be effective, the profile needs to be representative of how the users or a particular user runs the program. For many modern applications that are large and interactive, it takes a significant amount of time to collect high-quality profiles. This problem will only become more prominent as application programs grow more complex. A lengthy profiling process is especially impractical in software production environments, where programs are modified and rebuilt almost daily. Without enough time for extensive profiling runs, the benefit from applying PBO is severely limited. This in turn hampers the interest in running PBO and increases the dependency on hand tuning in software development and testing. \par In order to obtain high-quality profiles in a software production environment without lengthening the daily build cycle, we seek to change the current practice where a new profile must be generated from scratch for each new program version. Most of today's profiles are generated for a specific program version and become obsolete once the program changes. We propose progressive profiling, a new profiling methodology that propagates a profile across program changes and re-uses it on the new version. We use static analysis to generate a mapping between two versions of a binary program, then use the mapping to convert an existing profile for the old version so that it applies to the new version. When necessary, additional profile information is collected for part of the new version to augment the propagated profile. Since the additional profile collection is selective, we avoid the high expense of re-generating the entire profile. With progressive profiling, we can collect profile information from different generations of a program and build a high-quality profile through accumulation over time, despite frequent revisions in a software production environment. \par We present two different algorithms for matching binary programs for the purpose of profile propagation, and use common application programs to evaluate their effectiveness. We use a set of quantitative metrics to compare propagated profiles with profiles collected directly on the new versions. Our results show that for program builds that are weeks or even months apart, profile propagation can produce profiles that closely resemble directly collected profiles. To understand the potential for time saving, we implement a prototype system for progressive profiling and investigate a number of different system models. We use a case study to demonstrate that by performing progressive profiling over multiple generations of a program, we can save a significant amount of profiling time while sacrificing little profile quality. %X tr-11-01.ps.gz %R TR-12-01 %D 2001 %T Automated translation: generating a code generator %A Lee D. Feigenbaum %X A key problem in retargeting a compiler is to map the compiler's intermediate representation to the target machine's instruction set. /par One method to write such a mapping is to use grammar-like rules to relate a tree-based intermediate representation with an instruction set. A dynamic-programming algorithm finds the least costly instructions to cover a given tree. Work in this family includes Burg, BEG, and twig. The other method, utilized by gcc and VPO, uses a hand-written ``code expander'' which expands intermediate representation into naive code. The naive code is improved via machine-independent optimizations while maintaining it as a sequence of machine instructions. Because they are inextricably linked to a compiler's intermediate representation, neither of these mappings can be reused for anything other than retargeting one specific compiler. /par Lambda-RTL is a language for specifying the semantics of an instruction set independent of any particular intermediate representation. We analyze the properties of a machine from its Lambda-RTL description, then automatically derive the necessary mapping to a target architecture. By separating such analysis from compilers' intermediate representations, Lambda-RTL in conjunction with our work allows a single machine description to be used to build multiple compilers, along with other tools such as debuggers or emulators. \par Our analysis categorizes a machine's storage locations as special registers, general-purpose registers, or memory. We construct a data-movement graph by determining the most efficient way to move arbitrary values between locations. We use this information at compile time to determine which temporary locations to use for intermediate results of large computations. \par To derive a mapping from an intermediate representation to a target machine, we first assume a compiler-dependent translation from the intermediate representation to register-transfer lists. We discover at compile-compile time how to translate these register-transfer lists to machine code and also which register-transfer lists we can translate. To do this, we observe that values are either constants, fetched from locations, or the results of applying operators to values. Our data-movement graph covers constants and fetched values, while operators require an appropriate instruction to perform the effect of the operator. We search through an instruction set discovering instructions to implement operators via the use of algebraic identities, inverses, and rewrite laws and the introduction of unwanted side effects. %X tr-12-01.ps.gz %R TR-13-01 %D 2001 %T Instruction-Stream Compression %A Christian James Carrillo %X This thesis presents formal elements of instruction-stream compression. \par We introduce notions of instruction representations, compressors and the general ``patternization'' function for representations to sequences. We further introduce the Lua-ISC language, an implementation of these elements. Instruction-stream compression algorithms are expressed, independently of the target architecture, in Lua-ISC. The language itself handles instruction decoding and encoding, patternization and compression; programs within it are compact and readable. \par We perform experiments in instruction representation using Lua-ISC. Our results indicate that the choice of representation and patternization method affect compressor performance, and suggest that current design methodologies may overlook opportunities in lower-level representations. \par Finally, we discuss four instruction-stream compression algorithms and their expressions in Lua-ISC, two of which are our own. The first exploits inter-program redundancy due to static compilation; the second allows state-based compression techniques to function in a random-access environment by compressing instructions as sets of blocks. %X tr-13-01.ps.gz %R TR-14-01 %D 2001 %T CacheDAFS: User Level Client-Side Caching for the Direct Access File System %A Salimah Addetia %X This thesis focuses on the design, implementation and evaluation of user-level client-side caching for the Direct Access File System (DAFS). DAFS is a high performance file access protocol designed for local file sharing in high-speed, low latency networked environments. \par DAFS operates over memory-to-memory interconnects such as Virtual Interface (VI). VI provides a standard for efficient network communication by moving software overheads into hardware and eliminating the operating system from common data trasfers. While much work has been done on message passing and distributed shared memory in VI-like environments, DAFS is one of the first attempts to extend user-level networking to network file systems. In the environment of high-speed networks with virtual interfaces, software overheads such as data copies and translation, buffer managment and context switches become important bottlenecks. The DAFS protocol departs from traditional network file system practices to enhance performance. \par Distributed file systems use client-side caching to improve performance by reducing network traffic, disk traffic and server load. The DAFS client omits any caching. This thesis presents a user-space cache for DAFS called cacheDAFS with a careful design that avoids most bottlenecks in network file system protocols and user-level networking environments. CacheDAFS maintains perfect consistency among DAFS clients using NFSv4-like open delegations. Changes to the DAFS API in order to add caching are minimal and results show that DAFS applications can use cacheDAFS to reap all the standard benefits of caching. %X tr-14-01.ps.gz %R TR-01-02 %D 2002 %T Heuristic Search in Bounded-depth Trees: Best-Leaf-First Search %A Wheeler Ruml %X Many combinatorial optimization and constraint satisfaction problems can be formulated as a search for the best leaf in a tree of bounded depth. When exhaustive enumeration is infeasible, a rational strategy visits leaves in increasing order of predicted cost. Previous systematic algorithms for this setting follow a predetermined search order, making strong implicit assumptions about predicted cost and using problem-specific information inefficiently. We introduce a framework, best-leaf-first search (BLFS), that employs an explicit model of leaf cost. BLFS is complete and visits leaves in an order that efficiently approximates increasing predicted cost. Different algorithms can be derived by incorporating different sources of information into the cost model. We show how previous algorithms are special cases of BLFS. We also demonstrate how BLFS can derive a problem-specific model during the search itself. Empirical results on latin square completion, binary CSPs, and number partitioning problems suggest that, even with simple cost models, BLFS yields competitive or superior performance and is more robust than previous methods. BLFS can be seen as a model-based extension of iterative-deepening A*, and thus it unifies search for combinatorial optimization and constraint satisfaction with traditional AI heuristic search for shortest-path problems. %X tr-01-02.ps.gz %R TR-02-02 %D 2002 %T Bounds and Improvements for BiBa Signature Schemes %A Michael Mitzenmacher %A Adrian Perrig %X This paper analyzes and improves the recently proposed bins and balls (BiBa) signature, a new approach for designing signatures from one-way functions without trapdoors. \par We first construct a general framework for signature schemes based on the balls and bins paradigm and propose several new related signature algorithms. The framework also allows us to obtain upper bounds on the security of such signatures. Several of our signature algorithms approach the upper bound. We then show that by changing the framework in a novel manner we can boost the efficiency and security of our signature schemes. We call the resulting mechanism Powerball signatures. Powerball signatures offer greater security and efficiency than previous signature schemes based on one-way functions without trapdoors. %X tr-02-02.ps.gz %R TR-03-02 %D 2002 %T Scaling Filename Queries in a Large-Scale Distributed File System %A Jonathan Ledlie %A Laura Serban %A Dafina Toncheva %X We have examined the tradeoffs in applying regular and Compressed Bloom filters to the name query problem in distributed file systems and developed and tested a novel mechanism for scaling queries as the network grows large. Filters greatly reduced query messages when using Fan's "Summary Cache" in web cache hierarchies, a similar albeit smaller, searching problem. We have implemented a testbed that models a distributed file system and run experiments that test various configurations of the system to see if Bloom filters could provide the same kind of improvements. In a realistic system, where the chance that a randomly queried node holds the file being searched for is low, we show that filters always provide lower bandwidth/search and faster time/search, as long as the rates of change of the files stored at the nodes is not extremely high relative to the number of searches. In other words, we confirm the intuition that keeping some state about the contents of the rest of the system will aid in searching as long as acquiring this state is not overly costly and it does not expire too quickly. \par The grouping topology we have developed divides n nodes into log(n) groups, each of which has a representative node that aggregates a composite filter for the group. All nodes not in that group use this low-precision filter to weed out whole collections of nodes by probing these filters, only sending a search to be proxied by a member of the group if the probe of the group filter returns positively. Proxied searches are then carried out within a group, where more precise (more bits per file) filters are kept and exchanged between the n/log(n) nodes in a group. Experimental results show that both bandwidth/search and time/search are improved with this novel grouping topology. %X tr-03-02.ps.gz %R TR-04-02 %D 2002 %T %R TR-01-87 %D 1987 %T A Very Simple Construction for Atomic Multiwriter Register %A Ming Li %A Paul M.B. Vitanyi %X This paper introduces a new and conceptually very simple algorithm to implement an atomic {\it n} -reader {\it n} -writer variable directly from atomic 1-reader 1-writer variables, using bounded tags. The algorithm is developed top-down from the unbounded tag method in [VA]. This is the first direct such construction, and considerably improves the complexity of all known compound constructions. The algorithm uses new techniques, but its main virtue is that it is {\it conceptually very simple and easily proved correct.} %R TR-02-87 %D 1987 %T Efficient Dispersal of Information for Security Load Balancing and Fault Tolerance. %A Michael O. Rabin %X We develop an Information Dispersal Algorithm (IDA) which breaks a file $F$ of length $L=\mid F\mid$ into $n$ pieces of $F_ {i} , 1 \leq i \leq n$, each of length $\mid F_ {i} \mid=L/m$ , so that every $m$ pieces suffice for reconstructing $F$. Dispersal and reconstruction are computationally efficient. The sum of the lengths $\mid F_ {i} \mid$ is $(n/m) \cdot L$. Since $n/m$ can be chosen to be close to $1$, the IDA is space efficient. IDA has numerous applications to secure and reliable storage of information in computer networks and even on single disks, to fault-tolerant and efficient transmission of information in networks, and to communications between processors in parallel computers. For the latter problem we get provably time efficient and highly fault-tolerant routing on the $n$ - cube, using just constant size buffers. %R TR-03-87 %D 1987 %T Learning in the Presence of Malicious Errors. %A Michael Kearns %A Ming Li %X We study an extension to the Valiant model of machine learning from examples in which errors may be present in the sample data. We give strong bounds on the rate of error tolerable when the errors are of a ``malicious'' nature, and show that it is crucial to take both positive and negative examples when such errors exist. Quantitative comparisons between the malicious model of errors and a model of uniform random noise introduced by Angluin and Laird are made. A greedy heuristic for a new generalization of set covering that is of independent interest is given, with nontrivial applications to learning with errors. We also give a reduction from learning to set cover. %R TR-04-87 %D 1987 %T Two Phase Gossip:\\ Managing Distributed Event Histories. %A Abdelsalam Heddaya %A Meichun Hsu %A William Weihl %X We describe a distributed protocol that operates on replicated data objects---of arbitrary abstract types---represented in terms of the history of object events, rather than in terms of object values. In general, a site that stores a replica of an object has only partial knowledge of the object's event history. We call such a replica an {\it object representative} . The goal of the protocol is to limit the sizes of the histories by checkpointing and discarding old events. We treat separately the three functions of (1) propagating events to sites that do not know about them, (2) checkpointing, and (3) discarding old events. A site rolls forward its checkpoint state as far as it {\it knows} its local version of the object's history to be complete, but it discards old events only as far as it knows that all other sites {\it know} their local histories to be complete. Our protocol propagates events among sites in {\it gossip} messages exchanged in the background in a {\it two phase} manner. Each site maintains several timestamp vectors indicating the extent of its knowledge of other sites' knowledge of events. These timestamp vectors are included in gossip messages and used to decide the extent of global-completeness of each site's local version of the event history. We formally define the correctness criteria for our protocol in terms of the completeness properties of histories. %R TR-05-87 %D 1987 %T An Integrated Toolkit for Operating System Security. %A Michael O. Rabin %A J.D. Tygar %R TR-06-87 %D 1987 %T On the influence of single participant in coin flipping schemes. %A Benny Chor %A Mih\`{a}ly Ger\`{e}b-Graus %X We prove that in a one round fair coin flipping scheme with {\it n} participants, either the {\it average} influence of all participants is at least $3/n - o(l/n),$ or there is at least one participant whose influence is $\Omega \left( n^ {-5/6} \right).$ %R TR-07-87 %D 1987 %T The Complexity of Parallel Comparison Merging. %A Mih\`{a}ly Ger\`{e}b-Graus %A Danny Krizanc %X We prove a worst case lower bound of $\Omega(log log n)$ for randomized algorithms merging two sorted lists of length $n$ in parallel using $n$ processors on Valiant's parallel computation tree model. We show how to strengthen this result to a lower bound for the expected time taken by any algorithm on the uniform distribution. Finally, bounds are given for the average time required for the problem when the number of processors is less than and greater than $n$. %R TR-08-87 %D 1987 %T The Approximation of the Permanent is Still Open\\ - A Flaw in Broader's Proof- %A Milena Mihail %X In [B] Broder claims to have obtained a polynomial time randomized algorithm for approximating the permanent of dense 0-1 matrices. In this note we point out a major flaw in the analysis of the algorithm: the process C2 defined in [B] as a coupling is in fact {\it not} such. Thus, it remains open whether Broder's algorithm gives satisfactory estimates. %R TR-09-87 %D 1987 %T $k+1$ Heads are Better Than $k$ for PDA's %A Marek Chrobak %A Ming Li %X We prove the following conjecture stated by Harrison and Ibarra in 1968 [HI, p.462]: There are languages accepted by $(k$+1)-head 1- way deterministic pushdown automata $((k$+1)-DPDA) but not by $k$- head 1-way pushdown automata $(k$-PDA), for every $k$. (Partial solutions for this conjecture can be found in [M1,M2,C].) On the assumption that their conjecture holds, [HI] also derived some other consequences. Now all those consequences become theorems. For example, the class of languages accepted by $k$-PDA's is not closed under intersection and complementation. Several other interesting consequences also follow: CFL$\not\subseteq\cup_ {k} $DPDA($k$) and FA(2)$\not\subseteq\cup_ {k} $DPDA($k$), where DPDA($k$)=\ {L$\mid$L is accepted by a $k$-DPDA} and FA(2)=\ {L$\mid$L is accepted by 2-head FA} . Our proof is constructive (that is, not based on diagonalization). Before, the ``$k$+1 versus $k$ heads'' problem was solved by diagonalization and translation methods [I2,M2,M3, M4,S] for stronger machines (2 -way, etc), and by traditional counting arguments [S2,IK,YR,M1] for weaker machines ($k$-FA-$k$-head counter machines, etc). %R TR-10-87 %D 1987 %T Tape versus Queue and Stacks:\\ The Lower Bounds %A Ming Li %A Paul M.B. Vit\`{e}nyi %X Several new optimal or nearly optimal lower bounds are derived on the time needed to simulate queue, stacks (stack = pushdown store) and tapes by one off-line single-head tape-unit with one- way input, for both the deterministic case and the nondeterministic case. The techniques rely on algorithmic information theory (Kolmogorov complexity). %R TR-11-87 %D 1987 %T Plans for Discourse %A Barbara J. Grosz %A Candace L. Sidner %R TR-12-87 %D 1987 %T Oblivious Secret Computation %A Donald Beaver %X Recently, methods for secret, distributed, and fault-tolerant computation have been developed. Those techniques show how to ``compile'' a function to produce a protocol in which a distributed system evaluates that function at secret arguments, revealing the value of the function but not the values of the arguments, nor the values in intermediate steps of the computation. We extend these methods in two ways: first, we show that the {\it function} itself need not be revealed. That is, we show how to evaluate a function secretly, without revealing any information about the function itself, other than a bound on the time and space needed to compute it. This result has fundamental implications for distributed system security and fault-tolerant computing. Secondly, we show that revealing the {\it result} of a secret computation is not necessary, and we give useful applications in which the values of secretly computed functions are never revealed. One such application extends the ideas of oblivious transfer to the distributed environment. Using this extension, we solve classical, two-party oblivious transfer {\it without using cryptography} , unlike all previous solutions. %R TR-13-87 %D 1987 %T Polynomially Sized Boolean Circuits Are\\ Not Learnable %A Donald Beaver %X Polynomial-sized boolean circuits provide a powerful and general framework to describe many naturally occurring concepts. Because of the wide range of functions which can be described or computed by boolean circuits, an algorithm to learn arbitrary polynomial- sized circuits would have broad implications for artificial intelligence and learning theory. We prove, however, a conjecture made by Valiant that, if one-way functions exist, then the class of polynomial-sized boolean circuits is not probably-approximately learnable. We also discuss how this result might be strengthened by eliminating the assumption of one-way functions. %R TR-14-87 %D 1987 %T Oblivious Routing with Limited Buffer Capacity %A Danny Krizanc %X The problem of oblivious routing in fixed connection networks with a limited amount of space available to buffer packets is studied. We show that for a $n$ processor network with a constant number of connections and a constant number of buffers any deterministic pure source-oblivious strategy realizing all partial permutations requires $\Omega(n)$ time. The consequence of this result for well-known networks is discussed. %R TR-01-88 %D 1988 %T Bounded Time-Stamps %A Amos Israeli %A Ming Li %X Time-stamps are labels which a system adds to its data items. These labels enable the system to keep track of temporal precedence relation among its data elements. Traditionally time- stamps are used as unbounded numbers and inevitable overflows cause a loss of this precedence relation. In this paper we develop a theory {\it bounded time-stamps} . Time-stamp systems are defined and the complexity of their implementation is fully analyzed. This theory gives a general tool for converting time- stamp based protocols to bounded protocols. The power of this theory is demonstrated by a novel, conceptually simple, protocol for a multi-writer atomic register, as well as by proving, for the first time, a non-trivial lower bound for such a register. %R TR-02-88 %D 1988 %T Two Decades of Applied Kolmogorov Complexity %A Ming Li %A Paul M.B. Vitanyi %X This exposition is a survey of elegant and useful applications of Kolmogorov complexity. We distinguish three areas: I) Application of the fact that some strings are compressible. This includes a strong version of G\" {o} del's incompleteness theorem. II) Lower bound arguments which rest on application of the fact that certain strings cannot be compressed at all. Applications range from Turing machines to electronic chips. III) Other issues. For instance, the foundations of Probability Theory, a priori probability, and resource-bounded Kolmogorov complexity. Applications range from NP-completeness to inductive inference in Artificial Intelligence. %R TR-03-88 %D 1988 %T Hybrid Beam-Ray Tracing %A Joe Marks %A Robert J. Walsh %A Mark Friedell %X Ray tracing is the most accurate, but unfortunately also the most expensive of current rendering techniques. Beam tracing, suggested by Heckbert and Hanrahan, is a generalization of ray tracing that exploits area coherence to trace multiple rays in parallel. We present a new approach to beam tracing derived from the area-subdivision technique of Warnock's hidden-surface algorithm. We use this approach to beam tracing as the basis of a hybrid beam-ray tracing algorithm. The algorithm uses beam tracing to render large coherent regions of the image, and ray tracing to render complex regions. A heuristic decision procedure chooses between beam tracing and ray tracing a given area on the basis of estimated rendering costs. Experimental results show that our hybrid algorithm is more efficient than either ray tracing or beam tracing used alone. %R TR-04-88 %D 1988 %T Unsafe Operations in B-trees %A Bin Zhang %A Meichun Hsu %X A simple mathematical model for analyzing the dynamics of a B- tree node is presented. From the solution of the model, it is shown that the simple technique of allowing a B-tree node to be slightly less than half full can significantly reduce the rate of split, merge and borrow operations. We call split, merge, borrow and balance operations {\it unsafe} operations in this paper. In a multi-user environment, lower unsafe operation rate implies less blocking and higher throughput, even when tailored concurrency control algorithms (e.g., that proposed by [Lehman \& Yao]) are used. Less unsafe operation rate also means a longer life time of an optimally initialized B-tree (e.g., compact B-tree). It is in general useful to have an analytical model which can predict the rate of unsafe operations in a dynamic data structure, not only for comparing the behavior of variations of B trees, but also for characterizing workload for performance evaluation of different concurrency control algorithms for such data structures. The model presented in this paper represents a starting point in this direction. %R TR-05-88 %D 1988 %T The Mean Value Approach to Performance Evaluation of Cautious Waiting %A Meichun Hsu %A Bin Zhang %X We propose a deadlock-free locking-based concurrency control algorithm, called {\it cautious waiting} , which allows for a limited form of waiting, and present an analytical solution to its performance evaluation based on the mean-value approach. The proposed algorithm is simple to implement. From the modeling point of view, we are able to track both the restart rate and the blocking rate properly, and we show that to solve for the model we only need to solve for the root of a polynomial. From the performance point of view, the analytical tools developed enable us to see that the cautious waiting algorithm manage to achieve a {\it delicate balance} between restart and blocking, and therefore able to perform better (i.e., has higher throughput) than both the no waiting and the general waiting algorithms. This result is encouraging for other deadlock avoidance-oriented locking algorithms. %R TR-06-88 %D 1988 %T An Architecture-Independent Model for Parallel Programming %A Gary Wayne Sabot %X This dissertation describes a fundamentally new way of looking at parallel programming called the paralation model. The paralation model consists of a new data structure and a small, irreducible set of operators. The model can be combined with any base language to produce a concrete parallel language. One important goal of the model is ease of use for general problem solving. The model must provide tools that address the problems of the programmer. Equally important, languages based upon the model must be easy to compile, in a transparent and efficient manner, for a broad range of target computer architectures (for example, Multiple Instruction Multiple Data (MIMD) or Single Instruction Multiple Data (SIMD) processors; bus-based, butterfly, or grid interconnect; and so on). Several compilers based on this model exist, including one that produces code for the 65,536 processor Connection Machine. The dissertation includes a short (two pages long) operational semantics for a paralation language based on Common Lisp. By executing this code, the interested reader can experiment with the paralation constructs. The dissertation (as well as a disk containing a complete implementation of Paralation Lisp) is available from MIT Press as "The Paralation Model: Architecture-Independent Parallel Programming". %R TR-07-88 %D 1988 %T Managing Databases in Distributed Virtual Memory %A Meichun Hsu %A Va-On Tam %R TR-08-88 %D 1988 %T Modeling Performance Impact of Hot Spots %A Meichun Hsu %A Bin Zhang %X An important factor that may affect the performance of the concurrency control algorithm of a database system is the skewness of the distribution of accesses to data granules (i.e., non-uniform access pattern.) In this paper, we examine the impact of hot spot analytically by employing the mean-value approach to performance modeling of concurrency control algorithms. The impact of non-uniform accesses is approached based on a new principle of {\it data flow balance} . Using data flow balance we generalize the b-c access pattern (i.e. b\% of accesses directly to c\% of data granules) to arbitrary distributions, and solve the non-uniform access model for two classes of two phase locking algorithms. We also show that the database reduction factor employed previously in the literature to handle non-uniform access is in fact an upper bound. %R TR-09-88 %D 1988 %T Analytical Preprocessing for Likelihood Ratio Methods %A Bin Zhang %X An analytical preprocessing (APP) method for likelihood ratio method, an algorithm for finding the derivatives of a performance function of Discrete Event Dynamical Systems (DEDS), is developed. The convergency of the integral used in Likelihood ratio method and the legality of interchanging $\frac {\partial} {\partial\mu} $ with $\int$ are proved for a large class of sample performance functions. %R TR-10-88 %D 1988 %T Exemplar-based Learning: Theory and Implementation %A Steven Salzberg %X Exemplar-based learning is a theory of inductive learning in which learning is accomplished by storing objects in Euclidean $n$- space, $E^n$, as hyper-rectangles. In contrast to the usual inductive learning paradigm, which learns by replacing symbolic formulae by more general formulae, the generalization process for exemplar-based learning modifies hyper-rectangles by growing and reshaping them in a well-defined fashion. Some advantages and disadvantages of the theory are described, and the theory is compared to other inductive learning theories. An implementation has been tested on several different domains, four of which are presented here: predicting the recurrence of breast cancer, classifying iris flowers, predicting stock prices, and predicting survival times for heart attack patients. The robust learning behavior and easily understandable representations produced by the implementation support the claim that exemplar-based learning offers advantages over other learning models for some classes of problems. %R TR-11-88 %D 1988 %T An Algorithm for Self Diagnosis in Distributed Systems %A Azer A. Bestavros %X In this paper, and based on a powerful diagnostic model, we present a distributed diagnosis algorithm which would make a distributed system capable of self repairing (replacing) faulty units while operating in a gracefully degraded fault-tolerant manner. The algorithm allows the on-line reentry of repaired (replaced) units, and deals effectively with synchronization and time stamping problems. We define an active set to be a set of fault-free processing elements that agree on a unified diagnosis. The algorithm we propose, would enable each healthy processor of reliably identifying the largest possible active set to which it belongs. If each processor refrains from further dealings with units not in its active set, then logical reconfiguration can be achieved. Concurrent with this on-line fault isolation, analyzers in the system carry out detailed diagnosis to locate the faulty units, and upon a proper report, the associated controllers can perform the required repair/replace. Later, recovering units will be allowed to reenter the system. The algorithm is shown to be robust in that it guarantees the survival of any possible active set. In particular, if there are no more than {\it t} faults in the system, the algorithm guarantees that all fault-free processors will eventually identify the fault pattern, provided that the system is {\it t self diagnosable with without repair} . Furthermore, if the system is {\it t self diagnosable with repair,} then the algorithm guarantees that at least one controller will be able to repair/replace (at least) one faulty unit. %R TR-12-88 %D 1988 %T On Coupling and the Approximation of the Permanent %A Milena Mihail %X We discuss in detail the limitations of the coupling technique for a Markov Chain on a population with non-trivial combinatorial structure namely, perfect and near perfect matchings of a dense bipartite graph. %R TR-13-88 %D 1988 %T Distributed Non-Cryptographic Oblivious Transfer with Constant Rounds Secret Function Evaluation %A Donald Beaver %X We develop tools for distributively and secretly computing the {\it parity} of a hidden value in a {\it constant number of rounds} without revealing the secret argument or result, improving the previous $O(log \; n)$ round requirement, and we apply them to a {\it generalization of Oblivious Transfer} to the distributed environment. Oblivious Transfer, in which Alice reveals a bit to Bob with 50-50 probability, but without knowing whether or not Bob received it, has found diverse applications. In the case where Alice and Bob are alone, and they cannot trust each other, all known solutions require extensive cryptography. If, however, there exist other trustworthy parties around, we show that oblivious transfer can be performed without any cryptography, even if nobody knows who the trustworthy parties are. It suffices that at most a third or a half of the participants are dishonest. Using previously developed methods for distributed secret computation, we give a solution requiring $O(log \; n)$ rounds of interaction. We then give novel techniques to reduce the number of rounds to a constant. These new techniques include computing a random 0-1 secret, taking the multiplicative inverse of a secret, and normalizing a secret (without revealing the result) in a constant expected number of rounds. Such tools allow arbitrary functions over a polynomially sized domain of arguments to be evaluated quickly and efficiently. The methods developed for this paper provide a basis for solutions to other problems in distributed secret computation. %R TR-14-88 %D 1988 %T Learning Boolean Formulae or Finite Automata is as Hard as Factoring %A Michael Kearns %A Leslie G. Valiant %X We prove that the problem of inferring from examples Boolean formulae or deterministic finite automata is computationally as difficult as deciding quadratic residuosity, inverting the RSA encryption function and factoring Blum integers (composite number $p \cdot q$ where $p$ and $q$ are primes both congruent to 3 {\it mod} 4). These results are for the distribution-free model of learning. They hold even when the inference task is that of deriving a probabilistic polynomial time classification algorithm that predicts the correct value of a random input with probability at least $\frac {1} {2} + \frac {1} {p(s)} ,$ where $s$ is the size of the Boolean formula or deterministic finite automaton, and $p$ is any polynomial. %R TR-15-88 %D 1988 %T Merging and Routing on Parallel Models of Computation %A Daniel David Krizanc %X The term ``parallel computation'' encompasses a large and diverse set of problems and theory. In this thesis we use it to refer to any computation which consists of a large collection of tightly coupled synchronized processors working together to solve a terminating computational problem. Models of such computation capture to varying degrees properties of existing and proposed parallel machines. Our main concerns here are with how the models deal with the communication between processors and what effect the introduction of randomness has on the models. The main results of the thesis are: (a) a lower bound on the average time required by a parallel comparison tree (PCT) to merge two lists (b) a tight tradeoff between the running time, the probability of failure and the number of independent random bits used by an oblivious routing strategy for a fixed connection network (c) a similar tradeoff for the problem of finding the median on the PCT model and (d) a tradeoff between the amount of storage required by the nodes of a network and the deterministic time complexity of a class of oblivious routing strategies for the network. %R TR-16-88 %D 1988 %T Parallel Bin Packing Using First-fit and K-delayed Best-fit Heuristics %A Azer Bestavros %A William McKeeman %X In this paper, we present and contrast simulation results for the {\it Bin Packing} problem. We used the Connection Machine to simulate the asymptotic behavior of a variety of packing heuristics. Several parallel algorithms are considered, presented and contrasted in the paper. The seemingly serial nature of the bin packing simulation has prohibited previous experimentations from going beyond sizes of several thousands bins. We show that by adopting a fairly simple data parallel algorithm a speedup by a factor of $N$ over straightforward serial implementation is possible. Sizes of up to hundreds of thousands of bins have been simulated for different parameters and heuristics. %R TR-17-88 %D 1988 %T Ray Tracing for Massively-Parallel Machines %A Robert J. Walsh %X Proper data-type modeling for ray tracing on massively-parallel machines is presented as a basis for parallel graphics algorithm construction. A spatial enumeration algorithm is developed, discussed, modeled, and evaluated. The spatial enumeration evaluation is used to motivate an octree algorithm appropriate for parallel machines. These algorithms are suitable for any graphics problem which can be solved using rays, including radiosity. While the focus is computer graphics, general-purpose, massively- parallel machine architecture is discussed sufficiently to support the analysis and to generate some ideas for machine design. %R TR-18-88 %D 1988 %T A New Scan-Line Algorithm for Massively-Parallel Machines %A Robert J. Walsh %X This paper describes how to rapidly generate a synthetic image using a new rendering algorithm for general-purpose, massively- parallel architectures. The presentation shows how algorithm design decisions were made and how to customize the algorithm to a particular architecture. Alternative algorithms are analyzed qualitatively and quantitatively to demonstrate the superiority of the new algorithm. %R TR-19-88 %D 1988 %T Secure Multiparty Protocols Tolerating Half Faulty Processors %A Donald Beaver %X We show how any function of $n$ inputs can be evaluated by a complete network of $n$ processors, revealing no information other than the result of the function, and tolerating up to $t$ maliciously faulty parties for $2t < n$. We demonstrate a resilient method to multiply secret values without using cryptography. The crux of our method is a new, non-cryptographic zero-knowledge technique by which a single party can secretly share values $a_ {1} , \cdots , a_m$ along with another secret $B = P(a_ {1} , \cdots, a_m)$, where $P$ is an arbitrary function; and by which the party can prove to all other parties that $B = P(a_ {1} , \cdots, a_ {m} )$, without revealing $B$ or any other information. Using this technique we give a protocol for multiparty private computation that improves the bound established by previous results, which required $3t < n$. Our protocols allow an exponentially small chance of error, but are provably optimal in their resilience against Byzantine faults. %R TR-20-88 %D 1988 %T Managing Event-based Replication for Abstract Data Types in Distributed Systems %A Abdelsalam Abdelhamid Heddaya %X Data replication enhances the availability of data in distributed systems. This thesis deals with the management of a particular representation of replicated data objects that belong to Abstract Data Types (ADT). Traditionally, replicated data objects have been stored in terms of their {\em states} , or {\em values} . In this thesis, I argue for the viability of state transition {\em histories,} or {\em logs} , as a more suitable storage representation for abstract data types in distributed computing environments. We present two main contributions: a new protocol for reducing message and storage requirements of histories, and a novel reconfiguration and recovery method. In the first protocol, we introduce the notion of {\em two phase gossip} as the primary mechanism for managing distributed replicated event histories. We focus our second protocol for reconfiguration and recovery on enhancing the availability of distributed objects in the face of sequences of failures. Additionally, our reconfiguration protocol supports system administration functions related to the storage of distributed objects. In combination, the two protocols that we propose demonstrate the viability and desirability of the distributed representation of an ADT object as a history of the state transitions that the data object undergoes, rather than as the value or the sequence of values that it assumes. %R TR-01-89 %D 1989 %T A Parallel Algorithm for Eliminating Cycles in Undirected Graphs %A Philip Klein %A Clifford Stein %X We give an efficient parallel algorithm for finding a maximal set of edge-disjoint cycles in an undirected graph. The algorithm can be generalized to handle a weighted version of the problem. %R TR-02-89 %D 1989 %T On the time-space complexity of reachability queries for preprocessed graphs %A Lisa Hellerstein %A Philip Klein %A Robert Wilber %X How much can preprocessing help in solving graph problems? In this paper, we consider the problem of reachability in a directed bipartite graph, and propose a model for evaluating the usefulness of preprocessing in solving this problem. We give tight bounds for restricted versions of the model that suggest that preprocessing is of limited utility. %R TR-03-89 %D 1989 %T On the Magnification of 0-1 Polytopes %A Milena Mihail %A Umesh Vazirani %R TR-04-89 %D 1989 %T Conductance and Convergence of Markov Chains %A Milena Mihail %A Umesh Vazirani %X Let $P$ be an irreducible and strongly aperiodic (i.e. $p_ {ii} \geq \frac {1} {2} \forall i)$ stochastic matrix. We obtain non- asymptotic bounds for the convergence rate of a Markov chain with transition matrix $P$ in terms of the {\it conductance} of $P$. These results have been so far obtained only for time-reversible Markov chains via partially linear algebraic arguments. Our proofs eliminate the linear algebra and therefore naturally extend to general Markov chains. The key new idea is to view the action of a strongly aperiodic stochastic matrix as a weighted averaging along the edges of the {\it underlying graph of} $P$. Our results suggest that the conductance (rather than the second largest eigenvalue) best quantifies the rate of convergence of strongly aperiodic Markov chains. %R TR-05-89 %D 1989 %T Transaction Synchronization in Distributed Shared Virtual Memory Systems %A Meichum Hsu %A Va-On Tam %X Distributed shared virtual memory (DSVM) is an abstraction which integrates the memory space of different machines in a local area network environment into a single logical entity. The algorithm responsible for maintaining this {\it virtually} shared image is called the {\it memory coherence algorithm} . In this paper, we study the interplay between memory coherence and {\it process synchronization} . In particular, we devise two-phase locking- based algorithms in a distributed system under two scenarios: {\it with} and {\it without} an underlying memory coherence system. We compare the performance of the two algorithms using simulation, and argue that significant performance gain can potentially result from bypassing memory coherence and supporting process synchronization directly on distributed memory. We also study the role of the {\it optimistic} algorithms in the context of DSVM, and show that some optimistic policy appears promising under the scenarios studied. %R TR-06-89 %D 1989 %T A VLSI Chip for the Real-time Information Dispersal and Retrieval for Security and Fault-Tolerance %A Azer Bestavros %X In this paper, we describe SETH, a hardwired implementation of the recently proposed ``Information Dispersal Algorithm'' (IDA). SETH allows the real-time dispersal of information into different pieces as well as the retrieval of the original information from the available pieces. SETH accepts a stream of data and a set of ``keys'' so as to produce the required streams of dispersed data to be stored on (or communicated with) the different sinks. The chip might as well accept the streams of data from the different sinks along with the necessary controls and keys so as to reconstruct the original information. We begin this paper by introducing the Information Dispersal Algorithm and give an overview of SETH operation. The different functions are described and system block diagrams of varying levels of details are presented. Next, we present an implementation of SETH using Scalable CMOS technology that has been fabricated using the MOSIS 3-micon process. We conclude the paper with potential applications and extensions of SETH. In particular, we emphasize the promise of the Information Dispersal Algorithm in the design of I/O subsystems, Redundant Array of Inexpensive Disks (RAID) systems, reliable communication and routing in distributed/parallel systems. SETH demonstrates that using IDA in these applications is feasible. %R TR-07-89 %D 1989 %T General Purpose Parallel Architectures %A L.G. Valiant %X The possibilities for efficient general purpose parallel computers are examined. First some network models are reviewed. It is then shown how these networks can efficiently implement some basic message routing functions. Finally, various high level models of parallel computation are described and it is shown that the above routing functions are sufficient to implement them efficiently. %R TR-08-89 %D 1989 %T Bulk-Synchronous Parallel Computers %A L.G. Valiant %X We attribute the success of the von Neumann model of sequential computation to the fact that it is an efficient bridge between software and hardware. On the one hand high level languages can be efficiently compiled on to this model. On the other, it can be efficiently implemented in hardware in current technology. We argue that an analogous bridge between software and hardware is required for parallel computation if that is to become as widely used. We introduce the bulk-synchronous parallel (BSP) model as a candidate for this role. We justify this suggestion by giving a number of results that quantify its efficiency both in implementing some high level language features, as well as in being implemented in hardware. %R TR-09-89 %D 1989 %T Scheduling Initialization Equations for Parallel Execution %A Lilei Chen %R TR-10-89 %D 1989 %T Hiding Information from Several Oracles %A Donald Beaver %A Joan Feigenbaum %X Abadi, Feigenbaum, and Kilian have considered {\it computations with encrypted data} [AFK]. Let $f$ be a function that is not provably computable in randomized polynomial time; randomized polynomial-time machine A wants to query an oracle B for $f$ to obtain $f(x)$, without telling B exactly what $x$ is. Several well-known random-self-reducible functions, such as discrete logarithm and qua\-dra\-tic residuosity, are {\it encryptable} in this sense; that is, A can query B about an instance while hiding some significant information about the instance. It is shown in [AFK] that, if $f$ is an NP-hard function, A cannot query B while keeping secret all but the size of the instance, assuming that the polynomial hierarchy does not collapse. This negative result holds even if the oracle B has ``infinite'' computational power. Here we show that A {\it can} query $n$ oracles B$_1$, $\ldots$, B$_n$, where $n=|x|$, and obtain $f(x)$ while hiding all but $n$ from each B$_i$, {\it for any boolean function f} . This answers a question due to Rivest that was left open in [AFK]. Our proof adapts techniques developed by Ben-Or, Goldwasser, Wigderson and Chaum, Cr\'epeau, Damg\a ard for using Shamir's {\it secret- sharing} scheme to hide information about inputs to distributed computations [BGW], [CCD], [S]. %R TR-11-89 %D 1989 %T Perfect Privacy for Two-Party Protocols %A Donald Beaver %X We examine the problem of computing functions in a distributed manner based on private inputs, revealing the value of a function without revealing any additional information about the inputs. A function $f(x_1,\dots,x_n)$ is $t$-private if there is a protocol whereby $n$ parties, each holding an input value, can compute $f,$ and no subset of $t$ or fewer parties gains any information other than what is computable from $f(x_1,\dots,x_n).$ The class of $t$ -private {\em boolean} functions for $t \geq\lceil \frac {n} {2} \rceil$ was described in [CK89]. We give a characterization of 1- private functions for two parties, without the restriction to the boolean case. We also examine a restricted form of private computation in the $n$-party case, and show that addition is the only privately computable function in that model. %R TR-12-89 %D 1989 %T The Input Output Real-Time Automaton: A model for real-time parallel computation %A Azer A. Bestavros %X In this paper, we propose a unified framework for embedded real- time systems in which it would be possible to see the relation between issues of specification, implementation, correctness, and performance. The framework we suggest is based on the IORTA {\it (Input-Output Real-Time Automata)} model which is an extension of the IOA model introduced. An IORTA is an abstraction that encapsulates a system task. An embedded system is viewed as a set of interacting IORTAs. IORTAs communicate with each other and with the external environment using {\it signals} . A signal carries a sequence of {\it events} , where an event represents an instantiation of an {\it action} at a specific point in time. Actions can be generated by either the environment or the IORTAs. Each IORTA has a {\it state} . The state of an IORTA is observable and can only be changed by local {\it computations} . Computations are triggered by actions and have to be scheduled to meet specific timing constraints. IORTAs can be {\it composed} together to form higher level IORTAs. A specification of an IORTA is a description of its behavior (i.e. how it reacts to stimuli from the environment). An IORTA is said to {\it implement} another IORTA, if it is impossible to differentiate between their external behaviors. This is the primary tool that is used to verify that an implementation meets the required specification. %R TR-13-89 %D 1989 %T The Computational Complexity of Machine Learning %A Michael J. Kearns %X This thesibs is a study of the computational complexity of machine learning from examples in the distribution-free model introduced by L.G. Valiant. In the distribution-free model, a learning algorithm receives positive and negative examples of an unknown target set ( {\it or concept} ) that is chosen from some known class of sets ( {\it concept class} ). These examples are generated randomly according to a fixed but unknown probability distribution representing Nature, and the goal of the learning algorithm is to infer an hypothesis concept that closely approximates the target concept with respect to the unknown distribution. This thesis is concerned with proving theorems about learning in this formal mathematical model. We are interested in the phenomenon of {\it efficient} learning in the distribution-free model, in the standard polynomial-time sense. Our results include general tools for determining the polynomial- time learnability of a concept class, an extensive study of efficient learning when errors are present in the examples, and lower bounds on the number of examples required for learning in our model. A centerpiece of the thesis is a series of results demonstrating the computational difficulty of learning a number of well-studied concept classes. These results are obtained by reducing some apparently hard number-theoretic problems from cryptography to the learning problems. The hard-to-learn concept classes include the sets represented by Boolean formulae, deterministic finite automata and a simplified form of neural networks. We also give algorithms for learning powerful concept classes under the uniform distribution, and give equivalences between natural models of efficient learnability. This thesis also includes detailed definitions and motivation for the distribution- free model, a chapter discussing past research in this model and related models, and a short list of important open problems. %R TR-14-89 %D 1989 %T Learning with Nested Generalized Exemplars %A Steven Lloyd Salzberg %X This thesis presents a theory of learning called nested generalized exemplar theory (NGE), in which learning is accomplished by storing objects in Euclidean $n$-space, $E^n$, as hyper-rectangles. The hyper-rectangles may be nested inside one another to arbitrary depth. In contrast to most generalization processes, which replace symbolic formulae by more general formulae, the generalization process for NGE learning modifies hyper-rectangles by growing and reshaping them in a well-defined fashion. The axes of these hyper-retangles are defined by the variables measured for each example. Each variable can have any range on the real line; thus the theory is not restricted to symbolic or binary values. The basis of this theory is a psychological model called exemplar-based learning, in which examples are stored strictly as points in $E^n$. This thesis describes some advantages and disadvantages of NGE theory, positions it as a form of exemplar-based learning, and compares it to other inductive learning theories. An implementation has been tested on several different domains, four of which are presented in this thesis: predicting the recurrence of breast cancer, classifying iris flowers, predicting survival times for heart attack patients, and a discrete event simulation of a prediction task. The results in these domains are at least as good as, and in some cases significantly better than other learning algorithms applied to the same data. Exemplar-based learning is emerging as a new direction for machine learning research. The main contribution of this thesis is to show how an exemplar-based theory, using nested generalizations to deal with exceptions, can be used to create very compact representations with excellent modelling capability. %R TR-15-89 %D 1989 %T Finite-State Analysis of Asynchronous Circuits with Bounded Temporal Uncertainty %A Harry R. Lewis %R TR-16-89 %D 1989 %T Flux Tracing: A Flexible Infrastructure for Global Shading %A Jon Christensen %A Joe Marks %A Robert Walsh %A Mark Friedell %X Flux tracing is a flexible, efficient, and easily implemented mechanism for determining scene intervisibility. Flux tracing can be combined with a variety of reflection models in several ways, yielding many different global-shading techniques. Several shading techniques based on flux tracing are illustrated. All provide intensity gradients and shadows with penumbras resulting from area light sources at finite distances. Some provide specular reflection and color bleeding, and some are extremely efficient. Flux tracing is both an expedient means of constructing an efficient global shader and a flexible tool for experimental development of global shading algorithms. %R TR-17-89 %D 1989 %T Efficient Use of Image and Intervisibility Coherence in Rendering and Radiosity Calculations %A Mark Friedell %A Joe Marks %A Robert Walsh %A Jon Christensen %X Rendering algorithms based on image-area sampling were first proposed almost 20 years ago. In attempting to solve the hidden- surface problem from many pixels simultaneously, area-sampling algorithms are the most aggressive attempts to exploit image coherence. Although area-sampling algorithms are intuitively appealing, they usually do not perform well. When adequate image coherence is present, however, they can perform extremely well for some image regions. We present in this paper two new, hybrid rendering algorithms that combine area sampling with point- sampling techniques. In the presence of significant image coherence, these algorithms are considerably faster than any generally applicable alternative algorithms; for no image are they perceptibly slower. We also show how similar hybrid techniques can exploit intervisibility coherence to efficiently determine form factors for radiosity calculations. %R TR-18-89 %D 1989 %T The Role of User Models in System Design %A Lisa Rubin Neal %X The inadequacy of systems designer's models of users has led to difficulties with the use of and the rejection of systems. We formulate a multi-dimensional user model, of which two dimensions are related to learning and decision-making styles and additional dimensions are related to the plurality of expertises involved in the use of a system. The incorporation of a user model characterizing relevant cognitive variables allows the tailoring of a system to the needs and abilities of its users. We use a number of computer games as rich tasks which allow us to derive information about users. We present the results of our research examining computer games and discuss the implications of this research for system design. %R TR-19-89 %D 1989 %T Fast Fault-Tolerant Parallel Communication with Low Congestion and On-Line Maintenance Using Information Dispersal %A Yuh-Dauh Lyuu %X Space-efficient Information Dispersal Algorithm (IDA) is applied to communication in parallel computers to achieve fast communication, low congestion, and fault tolerance on various networks. All schemes run {\em within\/} their said time bounds without long delay. Let $N$ denote the size of the network. In the case of hypercube, our communication scheme runs in $\,2\cdot \log N+1\,$ time using constant size buffers. Its probability of successful routing is at least $\,1-N^ {-2.419\cdot\log N + 1.5} $, proving Rabin's conjecture. The same scheme tolerates $\,N/(12\cdot e\cdot\log N)\,$ random link failures with probability at least $\,1-2\cdot N\cdot (\log N)^ {-\log N/12} \,$ ($e=2.718\ldots$). It can also tolerate $\,N/c\,$ random link failures with probability $\,1- O(N^ {-1} )\,$ for some constant $c$. For a class of $d$-way shuffle networks, our scheme runs in $\,\approx 2\cdot \ln N/\ln\ln N\,$ time using constant size buffers. Its probability of successful routing is at least $\,1-N^ {-\ln N/2} $. The same scheme tolerates $\,N/12\,$ random link failures with probability at least $\,1- N^ {-\ln\ln\ln N/12} .\,$ For a class of $d$-way digit-exchange networks, our scheme runs in $\,\approx 6\cdot \ln N/\ln\ln N\,$ time using constant size buffers. Its probability of successful routing is at least $\,1- o(N^ {-7\cdot\ln N} )$. The same scheme tolerates $\, N/(6\cdot e)\,$ random link failures with probability at least $\,1-2\cdot N^ {-\ln\ln\ln N/2} $. Another fault model where links fail independently with a constant probability is also considered. Numerical calculations show that with practical failure probabilities and sizes of the hypercube, our routing scheme for the hypercube performs well with high probability. On-line and efficient wire testing and replacement on the hypercube can be realized if our fault-tolerant routing scheme is used. Let $\,\alpha\,$ denote the total number of links. It is shown that $\,\approx\alpha/352\,$ wires can be disabled simultaneously without disrupting the ongoing computation or degrading the routing performance much. %R TR-20-89 %D 1989 %T Lower Bounds on Parallel, Distributed and Automata Computations %A Mih\'{a}ly Ger\'{e}b-Graus %X In this thesis we present a collection of lower bound results: A hierarchy of complexity classes on tree languages (analogous to the polynomial hierarchy) accepted by alternating finite state machines is introduced. By separating the deterministic and the nondeterministic classes of our hierarchy we give a negative answer to the folklore question whether the expressive power of the treeautomata is the same as that of the finite state automaton that can walk on the edges of the tree (bugautomaton). We prove that three-head one-way DFA cannot perform string-matching, that is, no three-head one-way DFA accepts the language $L=\{x\#y \mid x$ {\rm is a substring of} $y$, {\rm where} $x,y \in \{0,1\} ^*\} $. We prove that in a one round fair coin flipping (or voting) scheme with $n$ participants, there is at least one participant who has a chance to decide the outcome with probability at least $3/n-o(1/n)$. We prove an optimal lower bound on the average time required by any algorithm that merges two sorted lists on the parallel comparison tree model. We present a proof of a negative answer for a question raised by Skyum and Valiant, namely, whether the class of symmetric boolean functions has a p-complete family. We give a combinatorial characterization for the concept classes learnable from negative (or positive) examples only in the so called distribution free learning model. %R TR-21-89 %D 1989 %T A Data Mapping Parallel Language %A Vinod Kathail %A Dan C. Stefanescu %R TR-22-89 %D 1989 %T An Analysis of the Valiant-Brebner Hypercube\\ Routing Algorithm %A Athanasios Tsantilas %X We present the best known analysis of the running time of the Valiant-Brebner algorithm for routing $h$-relations in the $n$- dimensional hypercube. Refining an elegant proof technique due to Ranade we prove that for $h \sim n/\omega(n)$, where $\omega(n)$ tends to infinity arbitrarily slowly, the running time of this algorithm is $2n+O(n/\ln\omega(n))$ with very high probability. The same analysis holds for the directed $n$-butterfly as well. %R TR-23-89 %D 1989 %T Tight Bounds for Oblivious Routing in the Hypercube %A C. Kaklamanis %A D. Krizanc %A A. Tsantilas %X We prove that given an $N$-node communication network with maximum indegree $d$, any deterministic oblivious algorithm for routing an arbitrary permutation requires $\Omega(\sqrt {N} /d)$ time. This is an improvement of a result by Borodin and Hopcroft. For the $N$-node hypercube, in particular, we show a matching upper bound by exhibiting a deterministic oblivious algorithm which routes any permutation in $O(\sqrt {N} / \log N)$ time. The best previously known upper bound was $O(\sqrt {N} )$. %R TR-24-89 %D 1989 %T SIMD Algorithms for Image Rendering %A Robert J. Walsh %X A parameterized cost model for SIMD machines is developed and used to rank projective image rendering algorithms. The ranking considers such an extremely broad range of remote communication costs that it holds for all practical situations. Ranked first is a new projective algorithm presented in the thesis. Empirical results support the ranking developed from the analytic models. The cost model is also applied to ray tracing, so that precomputed approximate sorts of surface primitives can be rigorously used to speed parallel ray tracing. The general-purpose architecture assumption is justified by showing that special-purpose machines actually render images more slowly. We conjecture that neither SIMD nor MIMD architectures have a fundamental advantage when efficient algorithms are known for both machine types. The algorithms and analysis presented can serve as a model for SIMD computations in other application domains. %R TR-01-90 %D 1990 %T Perfect Privacy for Two-Party Protocols %A Donald Beaver %X We examine the problem of computing functions in a distributed manner based on private inputs, revealing the value of a function without revealing any additional information about the inputs. A function $f(x_1,\dots,x_n)$ is $t$-private if there is a protocol whereby $n$ parties, each We holding an input value, can compute $f,$ and no subset of $t$ or fewer parties gains any information other than what is computable from $f(x_1,\dots,x_n).$ The class of $t$-private {\em boolean} functions for $t \geq\lceil \frac {n} {2} \rceil$ was described in {ck89} . We give a characterization of 1-private functions for two parties, without the restriction to the boolean case. We also examine a restricted form of private computation in the $n$-party case, and show that addition is the only $t$-privately computable function in that model. Incorrect proofs of this characterization appeared in [Kushilevitz,1989] and an earlier technical report [Beaver,1989]. We present a different proof which avoids the errors of those works. This report supersedes Harvard Technical Report TR-11-89. %R TR-02-90 %D 1990 %T Cooperative Dialogues While Playing\\ Adventure %A David Albert %X To collect examples of the use of hedging phrases in informal conversation, three pairs of subjects were (separately) asked to cooperate in playing the computer game {\em Adventure} . While one typed commands into the computer, the two engaged in conversation about their options and best strategy. The conversation was recorded on audio tape, and their computer session was saved in a disk file. Subsequently, the audio tape was transcribed and merged with the computer record of the game, to produce a combined transcript showing what was typed along with a simultaneous running commentary by the participants. This report contains the complete transcripts and a discussion of the collection and transcription methodology. %R TR-03-90 %D 1990 %T M-LISP:\\ A Representation Independent Dialect of LISP with Reduction Semantics %A Robert Muller %X In this paper we propose to reconstruct LISP from first principles with an eye toward reconciling S-expression LISP's metalinguistic facilities with the kind of operational semantics advocated by Plotkin {pl81} . After reviewing the original definition of LISP we define the abstract syntax and the operational semantics of the new dialect, M-LISP, and show that its equational theory is consistent. Next we develop the operational semantics of an extension of M-LISP which features an explicitly callable {\em eval} and {\em fexprs} (i.e., procedures whose arguments are passed {\em by-representation} ). Since M-LISP is independent of any representation of its programs it has no {\em quotation} operator or any of its related forms. To compensate for this we encapsulate the shifting between mention and use which is performed globally by {\em quote} within the metalinguistic constructs that require it. The resulting equational system is shown to be inconsistent. We leave it as an open problem to find confluent variants of the metalinguistic constructs. %R TR-04-90 %D 1990 %T Syntax Macros in M-LISP:\\ A Representation Independent Dialect of LISP with Reduction Semantics %A Robert Muller %X In this paper we consider syntactic abstraction in M-LISP, a dialect of LISP which is independent of any representation of its programs. Since it is independent of McCarthy's original Meta- expression representation scheme, M-LISP has no {\em quotation} form or any of its related forms {\em backquote} , {\em unquote} or {\em unquote-splicing} . Given that LISP macro systems depend on this latter representation model, it is not obvious how to base syntax extensions in M-LISP. Our approach is based on an adaptation of the {\em Macro-by-Example} {kowa87} and {\em Hygienic} algorithms {kofrfedu86} . The adaptation for the quotation-free syntactic structure of M-LISP yields a substantially different model of syntax macros. The most important difference is that $\lambda$ binding patterns become apparent when an abstraction is first (i.e., partially) transcribed in the syntax tree. This allows us to define tighter restrictions on the capture of identifiers. This is not possible in S-expression dialects such as Scheme since $\lambda$ binding patterns are not apparent until the tree is completely transcribed. %R TR-05-90 %D 1990 %T Semantics Prototyping in M-LISP\\ (Extended Abstract) %A Robert Muller %X In this paper we describe a new semantic metalanguage which simplifies prototyping of programming languages. The system integrates Paulson's semantic grammars within a new dialect of LISP, M-LISP, which has somewhat closer connections to the $\lambda$-calculus than other LISP dialects such as Scheme. The semantic grammars are expressed as attribute grammars. The generated parsers are M-LISP functions that can return denotational (i.e., higher-order) representations of abstract syntax. We illustrate the system with several examples and compare it to related systems. %R TR-06-90 %D 1990 %T ESPRIT\\ Executable Specification of Parallel Real-time Interactive Tasks %A Azer Bestavros %X The vital role that embedded systems are playing and will continue to play in our world, coupled with their increasingly complex and critical nature, demand a rigorous and systematic treatment that recognizes their unique requirements. The Time- constrained Reactive Automaton (TRA) is a formal model of computation that admits these requirements. Using the TRA model, an embedded system is viewed as a set of {\em asynchronously} interacting automata (TRAs), each representing an {\em autonomous} system entity. TRAs are {\em input enabled} ; they communicate by signaling events on their {\em output channels} and by reacting to events signaled on their {\em input channels} . The behavior of a TRA is governed by {\em time-constrained causal relationships} between {\em computation-triggering} events. The TRA model is {\em compositional} and allows time, control, and computation {\em non-determinism} . In this paper we present ESPRIT, a specification language that is entirely based on the TRA model. We have developed a compiler that allows ESPRIT specifications to be executed in simulated time, thus, providing a valuable validation tool for embedded system specifications. We are currently developing another compiler that would allow the execution of ESPRIT specifications in real-time, thus, making it possible to write real-time programs directly in ESPRIT. %R TR-07-90 %D 1990 %T A Logic of Concrete Time Intervals\\ (Extended Abstract) %A Harry R. Lewis %X This paper describes (1) a finite-state model for asynchronous systems in which the time delays between the scheduling and occurrence of the events that cause state changes are constrained to fall between fixed numerical upper and lower time bounds; (2) a branching-time temporal logic suitable for describing the temporal and logical properties of asynchronous systems, for which the structures of (1) are the natural models; and (3) a functional verification system for asynchronous circuits, which generates, from a boolean circuit with general feedback and specified min/max rise and fall times for the gates, a finite-state structure as in (1), and then exhaustively checks a formal specification of that circuit in the language (2) against that finite-state model. %R TR-08-90 %D 1990 %T Generating Descriptions that Exploit a User's Domain\\ Knowledge %A Ehud Reiter %X Natural language generation systems should customize object descriptions according to the extent of their user's domain and lexical knowledge. The task of generating customized descriptions is formalized as a task of finding descriptions that are {\it accurate} (truthful), {\it valid} (fulfill the speaker's communicative goal), and {\it free of false implicatures} (do not give rise to unintended conversational implicatures) with respect to the current user model. An algorithm that generates descriptions that meet these constraints is described, and the computational complexity of the generation problem is discussed. %R TR-09-90 %D 1990 %T The Computational Complexity of Avoiding Conversational\\ Implicatures %A Ehud Reiter %X Referring expressions and other object descriptions should be maximal under the Local Brevity, No Unnecessary Components, and Lexical Preference preference rules; otherwise, they may lead hearers to infer unwanted conversational implicatures. These preference rules can be incorporated into a polynomial time generation algorithm, while some alternative formalizations of conversational implicature make the generation task NP-Hard. %R TR-10-90 %D 1990 %T Generating Appropriate Natural Language Object Descriptions %A Ehud Baruch Reiter %X Natural language generation (NLG) systems must produce different utterances for users with different amounts of domain and lexical knowledge. A utterance that is meant to be read by an expert should use technical vocabulary, and avoid explicitly mentioning facts the expert can immediately infer from the rest of the utterance. In contrast, an utterance that is meant to be read by a novice should avoid specialized vocabulary, and may be required to explicitly mention facts that would be obvious to an expert. An NLG system that does not customize utterances according to its user's domain and lexical knowledge may generate text that is incomprehensible to a novice, or text that leads an expert to infer unwanted {\it conversational implicatures} (Grice 1975). This thesis examines the problem of generating attributive descriptions of individuals, that is object descriptions that are intended to inform the user that a particular object has certain attributes. It proposes that such descriptions will be appropriate for a particular user if they are {\it accurate, valid} , and {\it free of false implicatures} with respect to a user-model that represents that user's relevant domain and lexical knowledge. Descriptions are represented as definitions of KL-ONE- type (Brachman and Schmolze 1985) classes, and a description is called {\it accurate} if it defines a class that subsumes the object being described; {\it valid} if every attribute the system wishes to communicate is either part of the description or a default attribute that is inherited by the class defined by the description; and {\it free of false implicatures} if it is maximal under three preference rules: No Unnecessary Components, Local Brevity, and Lexical Preference. %R TR-11-90 %D 1990 %T Domain Theory for Nonmonotonic Functions %A Yuli Zhou %A Robert Muller %X We prove several lattice theoretical fixpoint theorems based on the classical result of Knaster-Tarski. These theorems give sufficient conditions for a system of generally nonmonotonic functions, on a complete lattice to define a unique fixpoint. The primary objective of this paper is to develop a domain theoretic framework to study the semantics of general logic programs as well as various rule-based systems where the rules define nonmonotonic functions on lattices. %R TR-12-90 %D 1990 %T A Syntax and Semantics for Network Diagrams %A Joe Marks %X The ability to automatically design graphical displays of data will be important for the next generation of interactive computer systems. The research reported here concerns the automated design of network diagrams, one of the three main classes of symbolic graphical display (the other two being chart graphs and maps). Previous notions of syntax and semantics for network diagrams are not adequate for automating the design of this kind of graphical display. I present here a new formulation of syntax and semantics for network diagrams that is used in the ANDD (Automated Network Diagram Designer) system. The syntactic formulation differs from previous work in two significant ways: perceptual-organization phenomena are explicitly represented, and syntax is described in terms of constraints rather than as a grammar of term-rewriting rules. The semantic formulation is based on an application- independent model of network systems that can be used to model many real-world applications. The paper includes examples that show how these concepts are used by ANDD to automatically design network diagrams. %R TR-13-90 %D 1990 %T Avoiding Unwanted Conversational Implicatures in Text and Graphics %A Joe Marks %A Ehud Reiter %X We have developed two systems, FN and ANDD, that use natural language and graphical displays, respectively, to communicate information about objects to human users. Both systems must deal with the fundamental problem of ensuring that their output does not carry unwanted and inappropriate conversational implicatures. We describe the types of conversational implicatures that FN and ANDD can avoid, and the computational strategies the two systems use to generate output that is free of unwanted implicatures. %R TR-14-90 %D 1990 %T Models of Plans to Support Communications:\\ An Initial Report %A Karen E. Lochbaum %A Barbara J. Grosz %A Candace L. Sidner %X Agents collaborating to achieve a goal bring to their joint activity different beliefs about ways in which to achieve the goal and the actions necessary for doing so. Thus, a model of collaboration must provide a way of representing and distinguishing among agents' beliefs and of stating the ways in which the intentions of different agents contribute to achieving their goal. Furthermore, in collaborative activity, collaboration occurs in the planning process itself. Thus, rather than modelling plan recognition, per se, what must be modelled is the {\it augmentation} of beliefs about the actions of multiple agents and their intentions. In this paper, we modify and expand the SharedPlan model of collaborative behavior (Grosz \& Sidner 1990). We present an algorithm for updating an agent's beliefs about a partial SharedPlan and describe an initial implementation of this algorithm in the domain of network management. %R TR-15-90 %D 1990 %T An Information Dispersal Approach to Issues in Parallel Processing %A Yuh-Dauh Lyuu %X Efficient schemes for the following issues in parallel processing are presented: fast communication, low congestion, fault tolerance, simulation of ideal parallel computation models, synchronization in asynchronous networks, low sensitivity to variations in component speed, and on-line maintenance. All our schemes employ Rabin's information dispersal idea. We also develop an efficient information dispersal algorithm (IDA) based on the Fast Fourier Transform and an IDA-based voting scheme to enforce fault tolerance. Let $N$ denote the size of the hypercube network. We present a randomized communication scheme, FSRA (for ``Fault-tolerant Subcube Routing Algorithm''), that routes in $2 \cdot \log N + 1$ time using only constant size buffers and with probability of success $1 - N^ {-\Theta (\log N)} $. (All log's are to the base 2.) FSRA also tolerates $O(N)$ random link failures with high probability. Similar results are also obtained for the de Bruijn and the butterfly networks (without fault tolerance in the latter case). FSRA is employed to simulate, without using hashing, a class of CRCW PRAM (concurrent-read concurrent-write parallel random access machine) programs with a slowdown of $O(\log N)$ with almost certainty if combining is used. A fault-tolerant simulation scheme for general CRCW PRAM programs is also presented. A simple acknowledgement synchronizer can make all our routing schemes in this dissertation run on asynchronous networks without loss of efficiency. We further show that speed of any component--be it a processor or a link--has only linear impact on the run-time of FSRA; that is, the extra delay in run-time is only proportional to the drift in the component's delay and is independent of the size of the network. On-line maintainability makes the machine more available to the user. We show that, under FSRA, a constant fraction of the links can be disabled with essentially no impact on the routing performance. This result immediately suggests several efficient maintenance procedures. Based on the above results, a fault-tolerant parallel computing system, called HPC (for ``hypercube parallel computer''), is sketched at the end of this dissertation. %R TR-16-90 %D 1990 %T Identifying $\mu$-Formula Decision Trees with Queries %A Thomas R. Hancock %X We consider a learning problem for the representation class of $\mu$-formula decision trees, a generalization of $\mu$-formulas and $\mu$-decision trees. (The ``$\mu$'' form of a representation has the restriction that no variable appears more than once). The learning model is one of exact identification by oracle queries, where learner's goal is to discover an unknown function by asking membership queries (is the function true on some specified input?) and equivalence queries (is the function identical to some hypothesis we present, and if not what is an input on which they differ?). We present an identification algorithm using these two types of queries that runs in time polynomial in the number of variables, we and show that no such polynomial time algorithm exists that uses either membership or equivalence queries alone (in the latter case under the stipulation that the hypotheses are drawn from the same representation class). We further extend the algorithm to identify a broader class where the formulas taken over a more powerful basis including arbitrary threshold gates. %R TR-17-90 %D 1990 %T Design and Modeling with Schema Grammars %A Mark Friedell %A Sandeep Kochhar %X Graphical scene modeling is usually a time-consuming, tedious, and expensive manual activity. This paper proposes an approach to partially automating the process though a paradigm for cooperative user-computer scene modeling that we call {\em cooperative computer-aided design} (CCAD). Formal grammars, referred to as {\em schema grammars} , are used to imbue the modeling system with an elementary ``understanding'' of the kinds of scenes to be created. The grammar interpreter constructs part or all of the scene after accepting from the user partially completed scene components and descriptions of scene properties that must be incorporated into the final scene. This approach to modeling harnesses the power of the computer to construct scene detail, thereby freeing the human user to focus on essential creative decisions. This paper describes the structure and interpretation of schema grammars, and provides techniques for controlling the combinatorial explosion that can result from the undirected interpretation of the grammars. CCAD is explored in the context of two experimental systems---FLATS, for architectural design, and LG, for modeling landscapes. %R TR-18-90 %D 1990 %T Cooperative Computer-Aided Design: A Paradigm for Automating the Design and Modeling of Graphical Objects %A Sandeep Kochhar %X Design activity is often characterized by a search in which the designer examines various alternatives at several stages during the design process. Current computer-aided design (CAD) systems, however, provide very little support for this exploratory aspect of design. My research provides the foundation for {\em cooperative computer-aided design (CCAD)} ---a novel CAD technology that intersperses partial exploration of design alternatives by the computer with guiding design operations by the system user. CCAD combines the strengths of manual and automated design and modeling by allowing the user to make creative decisions and provide specialized detail, while exploiting the power of the machine to explore many design alternatives and create detail that is resonant with the user's design decisions. In the CCAD paradigm, the user expresses initial design decisions in the form of a partial design and a set of properties that the final design must satisfy. The user then initiates the generation by the system of alternative {\em partial} developments of the initial design subject to a ``language'' of valid designs. The results are then structured in a spatial framework through which the user moves to explore the alternatives. The user selects the most promising partial design, refines it manually, and then requests further automatic development. This process continues until a satisfactory complete design is created. I present the interpretation of schema grammars---a class of generative grammars for manipulating graphical objects---as the fundamental generative mechanism underlying CCAD. I describe in detail several mechanisms for providing user control over the generative process, thereby controlling the combinatorial explosion inherent in an unrestricted, undirected interpretation of generative grammars. I explore graphical browsing as a facility for efficiently perusing the set of design alternatives generated by the system. I also describe FLATS ( {\bf F} loor plan {\bf LA} you {\bf T} {\bf S} ystem)---a prototype CCAD system for the design of small architectural floor plans. %R TR-19-90 %D 1990 %T Understanding Subsumption and Taxonomy:\\ A Framework for Progress %A William A. Woods %X This paper continues a theme begun in my paper, ``What's in a Link,'' -- seeking a solid foundation for network representations of knowledge. Like its predecessor, its goal is to clarify issues and establish a framework for progress. The paper analyzes the concepts of subsumption and taxonomy and synthesizes a framework that integrates and clarifies many previous approaches and goes beyond them to provide an account of abstract and partially defined concepts. The distinction between definition and assertion is reinterpreted in a framework that accommodates probabilistic and default rules as well as universal claims and abstract and partial definitions. Conceptual taxonomies in this framework are shown to be useful for indexing and organizing information and for managing the resolution of conflicting defaults. The paper introduces a distinction between intensional and extensional subsumption and argues for the importance of the former. It presents a classification algorithm based on intensional subsumption and shows that its typical case complexity is logarithmic in the size of the knowledge base. %R TR-20-90 %D 1990 %T The KL-ONE Family %A William A. Woods %A James G. Schmolze %X The knowledge representation system KL-ONE has been one of the most influential and imitated knowledge representation systems in the Artificial Intelligence community. Begun at Bolt Beranek and Newman in 1978, KL-ONE pioneered the development of taxonomic representations that can automatically classify and assimilate new concepts based on a criterion of terminological subsumption. This theme generated considerable interest in both the formal community and a large community of potential users. The KL-ONE community has since expanded to include many systems at many institutions and in many different countries. This paper introduces the KL-ONE family and discusses some of the main themes explored by KL-ONE and its successors. We give an overview of current research, describe some of the systems that have been developed, and outline some future research directions. %R TR-21-90 %D 1990 %T Efficiency of Semi-Synchronous versus Asynchronous Networks %A Hagit Attiya %A Marios Mavronicolas %X The $s$-session problem is studied in {\em asynchronous} and {\em semi-synchronous} networks. Processes are located at the nodes of an undirected graph $G$ and communicate by sending messages along links that correspond to the edges of $G$. A session is a part of an execution in which each process takes at least one step; an algorithm for the $s$-session problem guarantees the existence of at least $s$ disjoint sessions. The existence of many sessions guarantees a degree of interleaving which is necessary for certain computations. It is assumed that the (real) time for message delivery is at most $d$. In the asynchronous model, it is assumed that the time between any two consecutive steps of any process is in the interval $[0,1]$; in the semi-synchronous model, the time between any two consecutive steps of any process is in the interval $[c,1]$ for some $c$ such that $0 < c \leq 1$, the {\em synchronous} model being the special case where $c=1$. In the {\em initialized} case of the problem, all processes are initially synchronized and take a step at time 0. For the asynchronous model, an fupper bound of $diam(G)(d+1)(s-1)$ and a lower bound of $diam(G)d(s-1)$ are presented; $diam(G)$ is the {\em diameter} of $G$. For the semi- synchronous model, an upper bound of $1+\min\{\lfloor \frac {1} {c} \rfloor +1, diam(G)(d+1)\} (s-2)$ is presented. The main result of the paper is a lower bound of $1+\min\{\lfloor \frac {1} {2c} \rfloor, diam(G)d\} (s-2)$ for the time complexity of any semi-synchronous algorithm for the $s$-session problem, under the assumption that $d \geq \frac {d} {\min\{\lfloor\frac {1} {2c} \rfloor, diam(G)d\}} + 2$. These results imply a time separation between semi-synchronous (in particular, synchronous) and asynchronous networks. Similar results are proved for the case where delays are not uniform. In the {\em uninitialized} case of the problem, all processes but one, the {\em initiator} , start in a {\em quiescent} state which they may leave upon receiving a message. Similar results are proved for this case. %R TR-22-90 %D 1990 %T Efficient execution of homogeneous tasks with unequal run times on the Connection Machine %A Azer Bestavros %A Thomas Cheatham %X A lot of scientific applications require the execution of a large number of identical {\em tasks} , each on a different set of data. Such applications can easily benefit from the power of SIMD architectures ( {\em e.g. the Connection Machine} ) by having the array of processing elements (PEs) execute the task in parallel on the different data sets. It is often the case, however, that the task to be performed involves the repetitive application of the same sequence of steps, {\em a body} , for a number of times that depend on the input or computed data. If the usual {\em task-level synchronization} is used, the utilization of the array of PEs degrades substantially. In this paper, we propose a {\em body- level synchronization} scheme that would boost the utilization of the array of PEs while keeping the required overhead to a minimum. We mathematically analyze the proposed technique and show how to optimize its performance for a given application. Our technique is especially efficient when the number of tasks to be executed is much larger than the number of physical PEs available. %R TR-23-90 %D 1990 %T Modelling Act-Type Relations in Collaborative Activity %A Cecile T. Balkanski %X Intelligent agents collaborating to achieve a common goal direct a significant portion of their effort to planning together. Because having a plan to perform a given task involves having knowledge about the ways in which the performance of a set of actions will lead to the performance of that task, the representation of actions and of relations among them is a central concern. This paper provides a formalism for representing act-type relations and complex act-type constructors in multiagent domains. To determine a set of relations that would span the space of complex collaborative actions, I analyzed a videotape of two people building a piece of furniture, and identified a group of relations that are adequate to represent the relationships among the actions occurring in the data. The definitions provided here are more complex than those used in earlier studies; in particular, I refine and expand Pollack's (1986) set of act-type relations (generation and enablement), provide constructors for building complex act-types from (simpler) act-types (simultaneity, conjunction, sequence and iteration), and make it possible to represent the joint actions of multiple agents. %R TR-24-90 %D 1990 %T Security, Fault Tolerance, and Communication Complexity in Distributed Systems %A Donald Rozinak Beaver %X We present efficient and practical algorithms for a large, distributed system of processors to achieve reliable computations in a secure manner. Specifically, we address the problem of computing a general function of several private inputs distributed among the processors of a network, while ensuring the correctness of the results and the privacy of the inputs, despite accidental or malicious faults in the system. Communication is often the most significant bottleneck in distributed computing. Our algorithms maintain a low cost in local processing time, are the first to achieve optimal levels of fault-tolerance, and most importantly, have low communication complexity. In contrast to the best known previous methods, which require large numbers of rounds even for fairly simple computations, we devise protocols that use small messages and a constant number of rounds {\em regardless} of the complexity of the function to be computed. Through direct algebraic approaches, we separate the {\em communication complexity of secure computing} from the {\em computational complexity} of the function to be computed. We examine security under both the modern approach of computational complexity-based cryptography and the classical approach of unconditional, information-theoretic security. We develop a clear and concise set of definitions that support formal proofs of claims to security, addressing an important deficiency in the literature. Our protocols are provably secure. In the realm of information-theoretic security, we characterize those functions which two parties can compute jointly with absolute privacy. We also characterize those functions which a weak processor can compute using the aid of powerful processors without having to reveal the instances of the problem it would like to solve. Our methods include a promising new technique called a {\em locally random reduction} , which has given rise not only to efficient solutions for many of the problems considered in this work but to several powerful new results in complexity theory. %R TR-25-90 %D 1990 %T Communication Issues in Parallel Computation %A Athanasios M. Tsantilas %X This thesis examines the problem of interprocessor communication in realistic parallel computers. In particular, we consider the problem of permutation routing and its generalizations in the mesh, hypercube and butterfly networks. Building on previous research, we derive lower bounds for a wide class of deterministic routing algorithms which imply that such algorithms create heavy traffic congestion. In contrast, we show that randomized routing algorithms result in both efficient and optimal upper bounds in the above networks. Experiments were also performed to test the behaviour of the randomized algorithms. These experiments suggest interesting theoretical problems. We also examine the problem of efficient interprocessor communication in a model suggested by recent advances in optical computing. The main argument of this thesis is that communication can be made efficient if randomization is used in the routing algorithms. %R TR-01-91 %D 1991 %T QCD on the Connection Machine: Beyond $^*\hbox{LISP}$ %A Ralph G. Brickner %A Clive F. Baillie %A S. Lennart Johnsson %X We report on the status of code development for a simulation of Quantun Chromodynamics (QCD) with dynamical Wilson fermions on the Connection Machine model CM-2. Our original code, written in \*Lisp, gave performance in the near -GFLOPS range. We have rewritten the most time-consuming parts of the code in the low-level programming system CMIS, including the matrix multiply and the communication. Current versions of the code run at approximately 3.6 GFLOPS for the fermion matrix inversion, and we expect the next version to reach or exceed 5 GFLOPS. %X tr-01-91.ps.gz %R TR-02-91 %D 1991 %T Communication and I/O Libraries %A S. Lennart Johnsson %A Patrick Worley %X tr-02-91.ps.gz %R TR-03-91 %D 1991 %T Optimal Communication Channel Utilization for Matrix Transposition and Related Permutations on Boolean Cubes %A S. Lennart Johnsson %A Ching-Tien Ho %X TR-16-92 SUPERCEDES TR-03-91 %R TR-04-91 %D 1991 %T Generalized Shuffle Permutations on Boolean Cubes %A S. Lennart Johnsson %A Ching-Tien Ho %X In a {\em generalized shuffle permutation} an address \ $ (a_ {q- 1} -a_ {q-2} \ldots a_ {} ) $ \ receives its content from an address obtained through a cyclic shift on a subset of the \ $ q $ \ dimensions used for the encoding of the addresses. Bit- complementation may be combined with the shift. We give an algorithm that requires \ $ \frac {K} {2} + 2 $ \ exchanges for \ $ K $ \ elements per processor, when storage dimensions are part of the permutation, and concurrent communication on all ports of every processor is possible. The number of element exchanges in sequence is independent of the number of processor dimensions \ $ \sigma_ {r} $ \ in the permutation. With no storage dimensions in the permutation our best algorithm requires \ $ (\sigma_ {r} + 1) \lceil \frac {K} {2 \sigma_ {r}} \rceil $ \ element exchanges. We also give an algorithm for \ $ \sigma_ {r} = 2, $ \ or the real shuffle consists of a number of cycles of length two, that requires \ $ \frac {K} {2} + 1 $ \ element exchanges in sequence when there no bit complement. The lower bound is \ $ \frac {K} {2} $ \ for both real and mixed shuffles with no bit complementation. The minimum number of communication start-ups is \ $ \sigma_ {r} $ \ for both cases, which is also the lower bound. The data transfer time for communication restricted to one port per processor is \ $ \sigma_ {r} \frac {K} {2} , $ \ and the minimum number of start-ups is \ $ \sigma_ {r} . $ \ The analysis is verified by experimental results on the Intel iPSC/1, and for one case also on the Connection Machine. %X tr-04-91.ps.gz %R TR-05-91 %D 1991 %T The Computational Complexity of Cartographic Label\\ Placement %A Joe Marks %A Stuart Shieber %X We examine the computational complexity of cartographic label placement, a problem derived from the cartographer's task of placing text labels adjacent to map features in such a way as to minimize overlaps with other labels and map features. Cartographic label placement is one of the most time-consuming tasks in the production of maps. Consequently, several attempts have been made to automate the label-placement task for some or all classes of cartographic features (punctual, linear, or areal features), but all previously published algorithms for the most basic task---point-feature-label placement---either exhibit worst- case exponential time complexity, or incorporate incomplete heuristics that may fail to find an admissible labeling even when one exists. The computational complexity of label placement is therefore a matter of practical significance in automated cartography. We show that admissible label placement is NP- complete, even for very simple versions of the problem. Thus, no polynomial time algorithm exists unless $P=NP$. Similarly, we show that optimal label placement can be solved in polynomial time if and only if $P=NP$, and this result holds even if we require only approximately optimal placements. The results are especially interesting because cartographic label placement is one of the few combinatorial problems that remains NP-hard even under a geometric (Euclidean) interpretation. The results are of broader practical significance, as they also apply to point-feature labeling in non- cartographic displays, e.g., the labeling of points in a scatter plot. %X tr-05-91.ps.gz %R TR-06-91 %D 1991 %T A Graphical Editor for Three-Dimensional Constraint-Based\\ Geometric Modeling %A Steven John Sistare %X The design of geometric models can be a painstaking and time- consuming task. Typical CAD packages are often primitive or deficient in the means they offer for placing geometry in a design or for subsequently modifying the geometry. Modification in particular can tax a user's patience when it requires many deletion, creation, or perturbation operations to effect a conceptually simple change in the design. One area of research that attempts to address these deficiencies involves the use of constraints on the geometry as a means of both specifying and controlling its shape. The form in which the constraint information is demanded from the user determines the ease of use of any geometric system that is based on constraints, and is one of the basic problems to be addressed in the design of such a system. I present a constraint-based geometric editor that allows the manipulation of both constraints and geometry using the direct -manipulation paradigm, which is well established as being of central importance in many easy-to-use systems. When using the editor, constraints are presented to the user graphically, in the context of the geometric design, and may be created, destroyed, and manipulated interactively along with the geometry. The constraints may either be created explicitly, or implicitly as a side effect of creating geometry. In addition, constraints are used in a novel way to facilitate interactive creation and positioning of geometry in three space, despite the limitations of commonly-available two-dimensional display and input devices. Lastly, whenever geometry is modified using direct manipulation, a solver is called which updates the geometry in accordance with the existing constraints. All of these features contribute to the ease of use of my system. I also present a solver that addresses another basic problem inherent in constraint-based systems; namely, the need to efficiently obtain a solution that instantiates the geometry so as to satisfy the constraints. I provide a robust and efficient solver that is $O(n^2)$ in the size of the geometric problem being solved and present the mathematics that support it. In addition, I present a new algorithm that partitions the free-form geometry and constraint network into a number of pieces that may be independently solved. For many networks, this algorithm yields a solution to the entire network in close to linear time. %R TR-07-91 %D 1991 %T Tight Upper and Lower Bounds on the Path Length\\ of Binary Trees %A Alfredo De Santis %A Giuseppe Persiano %X The {\em external path length} of a tree $T$ is the sum of the lengths of the paths from the root to each external node. The {\em maximal path length difference,} $\triangle$, is the difference between the length of the longest and shortest such path. We prove tight lower and upper bounds on the external path length of binary trees with $N$ external nodes and prescribed maximal path length difference $\triangle$. In particular, we give an upper bound that, for each value of $\triangle$, can be exactly achieved for infinitely many values of $N$. This improves on the previously known upper bound that could only be achieved up to a factor proportional $N$. We then use the upper bound to give a simple upper bound on the path length of Red-Black trees which is asymptotically tight. We also present, as a preliminary result, an elementary proof of the known upper bound. We finally prove a lower bound which can be exactly achieved for each value of $N$ and $\triangle \leq N/2$. %R TR-08-91 %D 1991 %T Abstract Semantics of First-Order Recursive Schemes %A Robert Muller %A Yuli Zhou %X We develop a general framework for deriving abstract domains from concrete semantic domains in the context of first-order recursive schemes and prove several theorems which ensure the correctness (safety) of abstract computations. The abstract domains, which we call {\em Weak Hoare powerdomains} , subsume the roles of both the abstract domains and the collecting interpretations in the abstract interpretation literature. %R TR-09-91 %D 1991 %T Semantic Domains for Abstract Interpretation %A Robert Muller %A Yuli Zhou %X In this paper we consider abstract interpretation of PCF programs. The main development is the extension of {\em weak powerdomains} to higher types. In the classical abstract interpretation approach, abstract domains are constructed explicitly and the abstract semantics is then related to the concrete semantics. In the approach introduced here, abstract domains are {\em derived} directly from concrete domains. The conditions for deriving the domains are intended to be as general as possible while still guaranteeing that the derived domain has sufficient structure so that it can be used as a basis for computing correct information about the concrete semantics. We prove three main theorems, the last of which ensures the correctness of abstract interpretation of PCF programs given safe interpretations of the constants. This generalizes earlier results obtained for the special case of strictness analysis. %R TR-10-91 %D 1991 %T Performance Modeling of Distributed Memory Architectures %A S. Lennart Johnsson %X We provide performance models for several primitive operations on data structures distributed over memory units interconnected by a Boolean cube network. In particular, we model single source, and multiple source concurrent broadcasting or reduction, concurrent gather and scatter operations, shifts along several axes of multi- dimensional arrays, and emulation of butterfly networks. We also show how the processor configuration, data aggregation, and the encoding of the address space affect the performance for two important basic computations: the multiplication of arbitrarily shaped matrices, and the Fast Fourier Transform. We also give an example of the performance behavior for local matrix operations for a processor with a single path to local memory, and a set of registers. The analytic models are verified by measurements on the Connection Machine model CM-2. %X tr-10-91.ps.gz %R TR-11-91 %D 1991 %T Outerjoins---How to Extend a Conventional Optimizer %A C\'{e}sar Galindo-Legaria %A Arnon Rosenthal %X Free choice among join orderings is one of the most powerful optimizations in a conventional optimizer. But the freedom is limited to Select/Project/Join queries. In this paper, we extend this freedom to queries that include outerjoins. Unlike previous work, these results are not limited to queries possessing a ``nice structure,'' or queries that are nicely represented in relational calculus. Our theoretical results concern query ``simplification'' and reassociation using a generalized outerjoin. We show how the necessary computation can be added rather easily to the join-order generation of a conventional query optimizer. %R TR-12-91 %D 1991 %T An Algorithm for Plan Recognition in Collaborative Discourse %A Karen E. Lochbaum %X A model of plan recognition in discourse must be based on intended recognition, distinguish each agent's beliefs and intentions from the other's, and avoid assumptions about the correctness or completeness of the agents' beliefs. In this paper, we present an algorithm for plan recognition that is based on the SharedPlan model of collaboration and that satisfies these constraints. %R TR-13-91 %D 1991 %T Object-Oriented Programming for Massively Parallel Machines %A Michael F. Kilian %X Large, robust massively parallel programs that are understandable (and therefore maintainable) are not yet a reality. Such programs require a programming methodology that minimizes the conceptual differences between the program and the domain addressed by the program, encourages reusability, and still produces robust programs that are readily maintained and reasoned about. This paper proposes the parallel object-oriented model. The model is constructed from an object-oriented methodology augmented by constructs and semantics for parallel processing, and satisfies the requirements for building large parallel applications. It presents a unique way of representing object references and of managing concurrent access to objects. The methodology may be extended for a wide range of computing platforms and application areas. %R TR-14-91 %D 1991 %T Plan Recognition in Collaborative Discourse %A Karen E. Lochbaum %X A model of plan recognition in discourse must be based on intended recognition, distinguish each agent's beliefs and intentions from the other's, and avoid assumptions about the correctness or completeness of the agents' beliefs. In this paper, we present an algorithm for plan recognition that is based on the SharedPlan model of collaboration and that satisfies these constraints. %R TR-15-91 %D 1991 %T Elliptic Curves in Computer Science:\\ Primality Testing, Factoring, and Cryptography %A Michael Mitzenmacher %R TR-16-91 %D 1991 %T Identifiability is Closed under Embeddings in Read-Once\\ Formulas or $\mu$-Decision Trees %A Thomas R. Hancock %X We show a general positive result that allows us to boost the expressiveness of projection closed classes of boolean functions that are identifiable with membership and equivalence queries. Such classes include monotone DNF formulas, read-once formulas, conjunctions of Horn clauses, switch configurations, $\mu$-formula decision trees, and read-twice DNF formulas. We show that when representations from such classes (rather than single literals) are tested at the leaves of a read-once formula, or on the internal nodes of a $\mu$-decision tree, the resulting representation class is still identifiable with membership and equivalence queries. The additional overhead in time and queries is polynomial. %R TR-17-91 %D 1991 %T Computational Complexity of a Problem in Molecular Structure Prediction %A J. Thomas Ngo %A Joe Marks %X The computational task of protein-structure prediction is believed to require exponential time, but previous arguments as to its intractability have taken into account only the size of a protein's conformational space. Such arguments do not rule out the possible existence of an algorithm, more selective than exhaustive search, that is efficient and exact. (An {\em efficient} algorithm is one that is guaranteed, for all possible inputs, to run in time bounded by a function polynomial in the problem size. An {\em intractable} problem is one for which no efficient algorithm exists.) Questions regarding the possible intractability of problems are often best answered using the theory of NP-completeness. In this treatment we show the NP- hardness of two typical mathematical statements of empirical potential energy function minimization for macromolecules. Unless all NP-complete problems can be solved efficiently, these results imply that a function-minimization algorithm can be efficient for protein-structure prediction only if it exploits protein-specific properties that prohibit the simple geometric constructions that we use in our proofs. Analysis of further mathematical statements of molecular structure prediction could constitute a systematic methodology for identifying sources of complexity in protein folding, and for guiding development of predictive algorithms. %X tr-17-91.ps.gz %R TR-18-91 %D 1991 %T Optimal All-to-All Personalized Communication with Minimum Span on Boolean Cubes %A S. Lennart Johnsson %A Ching-Tien Ho %X All-to-all personalized communication is a class of permutations in which each processor sends a unique message to every other processor. We present optimal algorithms for concurrent communication on all channels in Boolean cube networks, both for the case with a single permutation, and the case where multiple permutations shall be performed on the same local data set, but on different sets of processors. For \ $ K $ \ elements per processor our algorithms give the optimal number of elements transfer, \ $ K/2. $ \ For a succession of all-to-all personalized communications on disjoint subcubes of \ $ \beta $ \ dimensions each, our best algorithm yields \ $ \frac {K} {2} + \sigma - \beta $ \ element exchanges in sequence, where \ $ \sigma $ \ is the total number of processor dimensions in the permutation. An implementation on the Connection Machine of one of the algorithms offers a maximum speed-up of 50\% compared to the previously best known algorithm. %X tr-18-91.ps.gz %R TR-19-91 %D 1991 %T Matrix Multiplication on Hypercubes Using Full Bandwidth and Constant Storage %A Ching-Tien Ho %A S. Lennart Johnsson %A Alan Edelman %X For matrix multiplication on hypercube multiprocessors with the product matrix accumulated in place a processor must receive about \ $ P^ {2} /\sqrt {N} $ \ elements or each input operand, with operands of size \ $ P \times P $ \ distributed evenly over \ $ N $ \ processors. With concurrent communication on all ports, the number of element transfers in sequence can be reduced to \ $ P^ {2} /\sqrt {N} \log N $ \ for each input operand. We present a two-level partitioning of the matrices and an algorithm for the matrix multiplication with optimal data motion and constant storage. The algorithm has sequential arithmetic complexity \ $ 2P^ {3} , $ \ and parallel arithmetic complexity \ $ 2P^ {3} /N. $ \ The algorithm has been implemented on the Connection Machine model CM-2. For the performance on the 8K CM-2, we measured about 1.6 Gflops, which would scale up to about 13 Gflops for a 64K full machine. %X tr-19-91.ps.gz %R TR-20-91 %D 1991 %T On the Conversion between Binary Code and Binary-Reflected Gray Code on Boolean Cubes %A S. Lennart Johnsson %A Ching-Tien Ho %X We present a new algorithm for conversion between binary code and binary-reflected Gray code that requires approximately \ $ \frac {2K} {3} $ \ element transfers in sequence for \ $ K $ \ elements per node, compared to \ $ K $ \ element transfers for previously known algorithms. For a Boolean cube of \ $ n = 2 $ \ dimensions the new algorithm degenerates to yield a complexity of \ $ \frac {K} {2} + 1 $ \ element transfers, which is optimal. The new algorithm is optimal within a factor of \ $ \frac {1} {3} $ \ for any routing strategy. We show that the minimum number of element transfers for minimum path length routing is \ $ K $ \ with concurrent communication on all channels of every node of a Boolean cube. %X tr-20-91.ps.gz %R TR-21-91 %D 1991 %T All-to-All Broadcast and Applications on the Connection\\ Machine %A Jean-Philippe Brunet %A S. Lennart Johnsson %X An all-to-all broadcast routing algorithm that allows concurrent communication on all channels of the Connection Machine Boolean cube network is described. Explicit routing formulas are given for both the physical broadcast between processors, and the virtual broadcast within processors. Implementation issues are addressed and timings for the physical and virtual broadcast are given for the Connection Machine system CM-2. The peak data transfer rate for the physical broadcast on a 64k CM-2 is 4.1 Gbytes/sec, and the peak rate for the virtual broadcast is about 20 Gbytes/sec. Reshaping of arrays is shown experimentally to reduce the broadcast time by a factor of up to 7 by reducing the amount of local data motion. Finally, we also show how to exploit symmetry for computation of an interaction matrix using the all-to-all broadcast function. Further optimizations are suggested for \ $N- $body type calculations. Using the all-to-all broadcast function, a peak rate of 5 Gflops/s has been achieved for the \ $ N-$body computations in 32-bit precision on a 64k CM-2. %X tr-21-91.ps.gz %R TR-22-91 %D 1991 %T New Approaches to Automating Network-Diagram Layout %A Corey Kosak %A Joe Marks %A Stuart Shieber %X Network diagrams are a familiar graphic form that can express many different kinds of information. The problem of automating network -diagram layout has therefore received much attention. Previous research on network-diagram layout has focused on the problem of aesthetically optimal layout, using such criteria as the number of link crossings, the sum of all link lengths, and total diagram area. In this paper we propose a restatement of the network- diagram-layout problem in which layout-aesthetic concerns are subordinated to perceptual-organization concerns. We describe a notation for describing the visual organization of a network diagram. This notation is used in reformulating the layout task as a constrained-optimization problem in which constraints are derived from a visual-organization specification and optimality criteria are derived from layout-aesthetic considerations. Two new heuristic algorithms are presented for this version of the layout problem: one algorithm uses a rule-based strategy for computing a layout; the other is a massively parallel genetic algorithm. We demonstrate the capabilities of the two algorithms by testing them on a variety of network-diagram-layout problems. %R TR-23-91 %D 1991 %T Minimizing the Communication Time for Matrix Multiplication on Multi-Processors %A S. Lennart Johnsson %X We present a few algorithms that allow concurrency in communication on multiple channels of multi-processors to be exploited for the multiplication of matrices of arbitrary shapes. For multi-processors configured as \ $ n-$dimensional Boolean cubes our algorithms offer a speedup of the communication over previous algorithms for square matrices and square cubes by a factor of \ $ \frac {n} {2} . $ \ We show that configuring \ $ N $ \ processors as a three-dimensional array may reduce the communication complexity by a factor of \ $ \sqrt[6] {N} $ \ compared to the two-dimensional partitioning. The best two- dimensional configuration of the multi-processor nodes has a ratio between the number of rows and columns equal to the ratio between the number of rows and columns of the product matrix. The optimum three dimensional configuration has a ratio between the length of the machine axes equal to the ratio between the length of the three axis in matrix multiplication. For product matrices of extreme shape a one-dimensional partitioning may be optimum. All presented algorithms use standard communication functions. %X tr-23-91.ps.gz %R TR-24-91 %D 1991 %T Cooley-Tukey FFT on the Connection Machine %A S. Lennart Johnsson %A Robert L. Krawitz %X We describe an implementation of the Cooley Tukey complex-to- complex FFT on the Connection Machine. The implementation is designed to make effective use of the communications bandwidth of the architecture, its memory bandwidth, and storage with precomputed twiddle factors. The peak data motion rate that is achieved for the interprocessor communication stages is in excess of 7 Gbytes/s for a Connection Machine system CM-200 with 2048 floating-point processors. The peak rate of FFT computations local to a processor is 12.9 Gflops/s in 32-bit precision, and 10.7 Gflops/s in 64-bit precision. The same FFT routine is used to perform both one- and multi-dimensional FFT without any explicit data rearrangement. The peak performance for one-dimensional FFT on data distributed over all processors is 5.4 Gflops/s in 32-bit precision and 3.2 Gflops/s in 64-bit precision. The peak performance for square, two-dimensional transforms, is 3.1 Gflops/s in 32-bit precision, and for cubic, three dimensional transforms, the peak is 2.0 Gflops/s in 64-bit precision. Certain oblong shapes yield better performance. The number of twiddle factors stored in each processor is \ $ \frac {P} {2N} + \log_ {2} N $ \ for an FFT on \ $ P $ \ complex points uniformly distributed among \ $ N $ \ processors. To achieve this level of storage efficiency we show that a decimation-in-time FFT is required for normal order input, and a decimation-in-frequency FFT is required for bit-reversed input order. %X tr-24-91.ps.gz %R TR-25-91 %D 1991 %T Communication Efficient Multi-processor FFT %A S. Lennart Johnsson %A Michel Jacquemin %A Robert L. Krawitz %X Computing the Fast Fourier Transform on a distributed memory architecture by a direct pipelined radix-2 algorithm, a bi-section or multi-section algorithm, all yield the same communications requirement, if communication for all FFT stages can be performed concurrently, the input date is in normal order, and the data allocation consecutive. With a cyclic data allocation, or bit- reversed input data and a consecutive allocation, multi-sectioning offers a reduced communications requirement by approximately a factor of two. For a consecutive data allocation, normal input order, a decimation-in-time FFT requires that \ $ \frac {P} {N} + d - 2 $ \ twiddle factors be stored for \ $ P $ \ elements distributed evenly over \ $ N $ \ processors and the axis subject to transformation distributed over \ $ 2^ {d} $ \ processors. No communication of twiddle factors is required. The same storage requirements hold for a decimation-in-frequency FFT, bit-reversed input order, and consecutive data allocation. The opposite combination of FFT type and data ordering requires a factor of \ $ \log_ {2} N $ \ more storage for \ $ N $ \ processors. The peak performance for a Connection Machine system CM-200 implementation is 12.9 Gflops/s in 32-bit precision, and 10.7 Gflops/s in 64-bit precision for unordered transforms local to each processor. The corresponding execution rates for ordered transforms are 11.1. Gflops/s and 8.5 Gflops/s, respectively. For distributed one- and two-dimensional transforms the peak performance for unordered transforms exceeds 5 Gflops/s in 32-bit precision, and 3 Gflops/s in 64-bit precision. Three-dimensional transforms executes at a slightly lower rate. Distributed ordered transforms executes at a rate of about \ $ \frac {1} {2} $ \ to \ $ \frac {2} {3} $ \ of the unordered transforms. %X tr-25-91.ps.gz %R TR-26-91 %D 1991 %T Learning Nonoverlapping Perceptron Networks From Examples and Membership Queries %A Thomas R. Hancock %A Mostefa Golea %A Mario Marchand %X We investigate, within the PAC learning model, the problem of learning nonoverlapping perceptron networks. These are loop-free neural nets in which each node has only one outgoing weight. We give a polynomial time algorithm that PAC learns any nonoverlapping perceptron network using examples and membership queries. The algorithm is able to identify both the architecture and the weight values necessary to represent the function to be learned. Our results shed some light on the effect of the overlap on the complexity of learning in neural networks. %R TR-27-91 %D 1991 %T Labeling Point Features on Maps and Diagrams Using Simulated Annealing %A Jon Christensen %A Joe Marks %A Stuart Shieber %X A major factor affecting the clarity of graphical displays is the degree to which text labels obscure display features (including other labels) as a result of spatial overlap. Point-feature label placement (PFLP) is the problem of placing text labels adjacent to point features on a map or diagram in order to maximize legibility. This problem arises for all kinds of informational graphics, though it is most often associated with automated cartography. In this paper we present a comprehensive treatment of the PFLP problem. First, we summarize some recent results regarding the computational complexity of PFLP. These results show that optimal PFLP is NP-hard. Second, we survey previously reported algorithms for PFLP. Third, we describe a stochastic- optimization method for PFLP, based on simulated annealing. Finally, we present the results of an empirical comparison of the known algorithms for PFLP. Our results indicate that the simulated-annealing approach to PFLP is superior to all existing methods, regardless of label density. %R TR-28-91 %D 1991 %T Linearizable Read/Write Objects %A Marios Mavronicolas %A Dan Roth %X We study the cost of implementing {\em linearizable} read/write objects for shared-memory multiprocessors and under various assumptions on the available timing information. We take as cost measure the {\em worst-case response time} of performing an operation in distributed implementations of virtual shared memory consisting of such objects and supporting linearizability. It is assumed that processes have clocks that run at the same rate as real time and all messages incur a delay in the range $[d-u,d]$ for some known constants $u$ and $d$, $0 \leq u \leq d$. In the {\em perfect clocks} model, where processes have perfectly synchronized clocks and every message incurs a delay of exactly $d$, we present a family of optimal linearizable implementations, parameterized by a constant $\beta$, $0 \leq \beta \leq 1$, for which the worst-case response times for read and write operations are $\beta d$ and $(1-\beta)d$, respectively. The parameter $\beta$ may be appropriately chosen to account for the relative frequencies of read and write operations. Our main result is the first known linearizable implementation for the {\em imperfect clocks} model where clocks are not initially synchronized and message delays can vary, i.e., $u > 0$; it achieves worst-case response times of less than $4u+b$ ($b>0$ is an arbitrarily small constant) and $d+3u$ for read and write operations, respectively. This implementation uses novel synchronization techniques to exploit the lower bound on message delay time and achieve bounds on worst-case response times that depend on the message delay uncertainty $u$. For a wide range of values of $u$, these bounds improve previously known ones for implementations that support consistency conditions even weaker than linearizability. %R TR-01-92 %D 1992 %T Multiplication of Matrices of Arbitrary Shape on a Data\\ Parallel Computer %A Kapil K. Mathur %A S. Lennart Johnsson %X Some level-2 and level-3 Distributed Basic Linear Algebra Subroutines (DBLAS) that have been implemented on the Connection Machine system CM-200 are described. For matrix-matrix multiplication, both the nonsystolic and the systolic algorithms are outlined. A systolic algorithm that computes the product matrix in-place is described in detail. All algorithms that are presented here are part of the Connection Machine Scientific Software Library, CMSSL. We show that a level-3 DBLAS yields better performance than a level-2 DBLAS. On the Connection Machine system CM-200, blocking yields a performance improvement by a factor of up to three over level-2 DBLAS. For certain matrix shapes the systolic algorithms offer both improved performance and significantly reduced temporary storage requirements compared to the nonsystolic block algorithms. The performance improvement over the blocked nonsystolic algorithms may be as much as a factor of seven, or more than a factor of 20 over the level-2 DBLAS. We show that, in order to minimize the communication time, an algorithm that leaves the largest operand matrix stationary should be chosen for matrix-matrix multiplication. Furthermore, it is shown both analytically and experimentally that the optimum shape of the processor array yields square stationary submatrices in each processor, i.e., the ratio between the length of the axes of the processing array must be the same as the ratio between the corresponding axes of the stationary matrix. The optimum processor array shape may yield a factor of five performance enhancement for the multiplication of square matrices. For rectangular matrices a factor of 30 improvement was observed for an optimum processor array shape compared to a poorly chosen processor array shape. %X tr-01-92.ps.gz %R TR-02-92 %D 1992 %T A Data Parallel Finite Element Method for Computational Fluid Dynamics on the Connection Machine System %A Zden\u { e } k Johan %A Thomas J.R. Hughes %A Kapil K. Mathur %A \\ %A S. Lennart Johnsson %X A finite element method for computational fluid dynamics has been implemented on the Connection Machine systems CM-2 and CM-200. An implicit iterative solution strategy, based on the preconditioned matrix-free GMRES algorithm, is employed. Parallel data structures built on both nodal and elemental sets are used to achieve maximum parallelization. Communication primitives provided through the Connection Machine Scientific Software Library substantially improved the overall performance of the program. Computations of three-dimensional compressible flows using unstructured meshes having close to one million elements, such as a complete airplane, demonstrate that the Connection Machine systems are suitable for these applications. Performance comparisons are also carried out with the vector computers Cray Y- MP and Convex C-1. %X tr-02-92.ps.gz %R TR-03-92 %D 1992 %T Efficiency of Semi-Synchronous versus Asynchronous Systems: Atomic Shared Memory %A Marios Mavronicolas %X The {\em $s$-session problem} is studied in {\em asynchronous} and {\em semi-synchronous} shared-memory systems, under a particular shared-memory communication primitive -- $k$-writer, $k$-reader atomic registers, -- where $k$ is a constant reflecting the communication bound in the model. A session is a part of an execution in which each of $n$ processes takes at least one step; an algorithm for the $s$-session problem guarantees the existence of at least $s$ disjoint sessions. The existence of many sessions guarantees a degree of interleaving which is necessary for certain computations. In the asynchronous model, it is assumed that the time between any two consecutive steps of any process is in the interval $[0,1]$; in the semi-synchronous model, the time between any two consecutive steps of any process is in the interval $[c,1]$ for some $c$ such that $0 < c \leq 1$, the {\em synchronous} model being the special case where $c=1$. All processes are initially synchronized and take a step at time $0$. Our main result is a tight (within constant factors) lower bound of $1 + \min \{\lfloor \frac {1} {2c} \rfloor, \lfloor \log_ {k} (n-1) - 1 \rfloor \} (s-2)$ for the time complexity of any semi- synchronous algorithm for the $s$-session problem. This result implies a time separation between semi-synchronous and asynchronous shared memory systems. %R TR-04-92 %D 1992 %T Block-Cyclic Dense Linear Algebra %A Woody Lichtenstein %A S. Lennart Johnsson %X Block-cyclic order elimination algorithms for LU and QR factorization and solve routines are described for distributed memory architectures with processing nodes configured as two- dimensional arrays of arbitrary shape. The cyclic order elimination together with a consecutive data allocation yields good load-balance for both the factorization and solution phases for the solution of dense systems of equations by LU and QR decomposition. Blocking may offer a substantial performance enhancement on architectures for which the level-2 or level-3 BLAS are ideal for operations local to a processing node. High rank updates local to a node may have a performance that is a factor of four or more higher than a rank-1 update. We show that in our implementation the \ $ O(N^ {2} )$ \ work in the factorization is of the same significance as the \ $ O(N^ {3} ) $ \ work, even for large matrices, because the \ $ O(N^ {2} ) $ \ work is poorly load -balanced in the two-dimensional processor array configuration. However, we show that the two-dimensional processor array configuration with consecutive data allocation and block-cyclic order elimination is optimal with respect to communication for a simple, but fairly general communications model. In our Connection Machine system CM-200 implementation, the peak performance for LU factorization is about 9.4 Gflops/s in 64-bit precision and 16 Gflops/s in 32-bit precision. Blocking offers an overall performance enhancement of about a factor of two. For the data motion, use is made of the fact that the nodes along each axis of the two-dimensional array are interconnected as Boolean cubes. %X tr-04-92.ps.gz %R TR-05-92 %D 1992 %T Efficient, Strongly Consistent Implementations\\ of Shared Memory %A Marios Mavronicolas %A Dan Roth %X We present two distributed organizations of multiprocessor shared memory and develop for them implementations that are shown to satisfy a strong consistency condition, namely {\em linearizability} , achieve improvements in efficiency over previous ones that support even weaker consistency conditions and possess other important, sought after properties that make them practically attractive. It is assumed throughout this paper that processes have clocks that run at the same rate as real time and all messages incur a delay in the range $[d-u,d]$ for some known constants $u$ and $d$, $0 \leq u \leq d$. The efficiency of an implementation is measured by the {\em worst-case response time} for performing an operation on an object. For the {\em full caching} organization, where each process keeps local copies of all objects, we present the first efficient linearizable implementation of read/write objects. The family of linearizable implementations we present is parameterized in a way that allows to degrade the less frequently employed operation, and is shown to be essentially optimal. For the {\em single ownership} organization, each shared object is ``owned'' by a single process, which is most likely to access it frequently. We present an implementation, that allows a process to access local information much faster (almost instantaneously) than it can access remote information, while still supporting linearizability. While the cost of the global operations depends on the maximal message delay $d$, the cost of the local operations depends only on the message delay uncertainty $u$. In both implementations, decisions made by individual processes do not make use of any communicated timing information. In particular, timing information is not part of the messages passed by our protocols, and those are of bounded size. These two organizations can be combined in a hierarchical memory structure, which supports linearizability very efficiently; this hybrid structure allows processes to access local and remote information in a transparent manner, while at a lower level of the memory consistency system, different portions of the memory, allocated a priory according to anticipated remote versus local use of the objects, employ the suitable, full caching or single ownership implementation. %R TR-06-92 %D 1992 %T Lecture Notes on Domain Theory %A Robert Muller %X This report contains a collection of lecture notes for a series of lectures introducing Scott's {\em domain theory} . The basic structure of semantic domains and fixpoint theory are introduced. The lecture notes are not intended to serve as a primary reference but rather as a supplement to a more comprehensive treatment. %R TR-07-92 %D 1992 %T Index Transformation Algorithms in a Linear\\ Algebra Framework %A Alan Edelman %A Steve Heller %A S. Lennart Johnsson %X We present a linear algebraic formulation for a class of index transformations such as Gray code encoding and decoding, matrix transpose, bit reversal, vector reversal, shuffles, and other index or dimension permutations. This formulation unifies, simplifies, and can be used to derive algorithms for hypercube multiprocessors. We show how all the widely known properties of Gray codes, and some not so well-known properties as well, can be derived using this framework. Using this framework, we relate hypercube communications algorithms to Gauss-Jordan elimination on a matrix of 0's and l's. %X tr-07-92.ps.gz %R TR-08-92 %D 1992 %T An Alternative Conception of Tree-Adjoining Derivation %A Yves Schabes %A Stuart M. Shieber %X The precise formulation of derivation for tree-adjoining grammars has important ramifications for a wide variety of uses of the formalism, from syntactic analysis to semantic interpretation and statistical language modeling. We argue that the definition of tree-adjoining derivation must be reformulated in order to manifest the proper linguistic dependencies in derivations. The particular proposal is both precisely characterizable, through a compilation to linear indexed grammars, and computationally operational, by virtue of an efficient algorithm for recognition and parsing. %X tr-08-92.ps.gz %R TR-09-92 %D 1992 %T Local Basic Linear Algebra Subroutines (LBLAS) for\\ Distributed Memory Architectures and Languages with\\ Array Syntax %A S. Lennart Johnsson %A Luis F. Ortiz %X We describe a subset of the level-1, level-2, and level-3 BLAS implemented for each node of the Connection Machine system CM-200 and with a set of interfaces consistent with Fortran 90. The implementation performs computations on multiple instances in a single call to a routine. The strides for the different axes are derived from an array descriptor that contains information about the length of the axes, the number of instances and their allocation in the machine. Another novel feature of our implementation of the BLAS in each node is a selection of loop order for rank-1 updates and matrix-matrix multiplication based upon array shapes, strides, and DRAM page faults. The peak efficiencies for the routines are in the range 75\% to 90\%. The optimization of loop ordering has a success rate exceeding 99.8\% for matrices for which the sum of the length of the axes is at most 60. The success rate is even higher for all possible matrix shapes. The performance loss when a nonoptimal choice is made is less than $ \sim$15\% of peak, and typically less l\% of peak. We also show that the performance gain for high rank updates may be as much as a factor of 6 over rank-1 updates. %X tr-09-92.ps.gz %R TR-10-92 %D 1992 %T Direct Bulk-Synchronous Parallel Algorithms %A Alexandros V. Gerbessiotis %A Leslie G. Valiant %X We describe a methodology for constructing parallel algorithms that are transportable among parallel computers having different numbers of processors, different bandwidths of interprocessor communication and different periodicity of global synchronization. We do this for the bulk-synchronous parallel (BSP) model, which abstracts the characteristics of a parallel machine into three numerical parameters $p$, $g$, and $L$, corresponding to processors, bandwidth, and periodicity respectively. The model differentiates memory that is local to a processor from that which is not, but, for the sake of universality, does not differentiate network proximity. The advantages of this model in supporting shared memory or PRAM style programming have been treated elsewhere. Here we emphasize the viability of an alternative direct style of programming where, for the sake of efficiency the programmer retains control of memory allocation. We show that optimality to within a multiplicative factor close to one can be achieved for the problems of Gauss-Jordan elimination and sorting, by transportable algorithms that can be applied for a wide range of values of the parameters $p$, $g$, and $L$. We also give some simulation results for PRAMs on the BSP to identify the level of slack at which corresponding efficiencies can be approached by shared memory simulations, provided the bandwidth parameter $g$ is good enough. %X tr-10-92.ps.gz %R TR-11-92 %D 1992 %T Deriving Parallel and Systolic Programs from Data Dependence %A Lilei Chen %X We present an algorithm that statically sequences data computations and communications for parallel and systolic executions. Instead of searching for implicit parallelism in a functional or sequential program, the algorithm looks for sequence requirements imposed by the data dependence and by the communication delays. It achieves its efficiency by analyzing the sequence constraints. The actual sequence of the parallel computations can be decided at the last stage to tailor to the specific parallel machine. In addition, once the processor mapping is decided, the data communication delay is combined with the computation sequence to obtain the final scheduling of the computations and communications. As a result, the parallel implementation fully explores the parallelism in the original program and effectively schedules the computations and minimizes the communication cost by systolic design. Because the algorithm is only based on the data dependence from the original program, it can be applied to a wide variety of program forms, from sequential loop programs with updates, to recursive equation sets. It can detect parallelism in sequential programs, as well as provide efficient implementations for recurrence statements in equation sets. %R TR-12-92 %D 1992 %T Algebraic Optimization of Outerjoin Queries %A C\'{e}sar Alejandro Galindo-Legaria %X The purpose of this thesis is to extend database optimization techniques for joins to queries that contain both joins and outerjoins. The benefits of query optimization are thus extended to a number of important applications, such as federated databases, nested queries, and hierarchical views, for which outerjoin is a key component. Our analysis of join/outerjoin queries is done in two parts. First, we investigate the interaction of outerjoin with other relational operators, to find simplification rules and associativity identities. Our approach is comprehensive and includes, as special cases, some outerjoin optimization heuristics that have appeared in the literature. Second, we abstract the notion of feasible evaluation order for binary, join-like operators, considering associativity rules but not specific operator semantics. Combining these two parts, we show that a join/outerjoin query can be evaluated by combining relations in any given order ---just as it is done on join queries, except that now we need to synthesize an operator to use at each step, rather than using always join. The purpose of changing the order of processing of relations is to reduce the size of intermediate results. Our theoretical results are converted into algorithms compatible with the architecture of conventional database optimizers. For optimizers that repeatedly transform feasible strategies, the outerjoin identities we have identified can be applied directly. Those identities are sufficient to obtain all possible orders of processing. For optimizers that generate join programs bottom-up, we give a rule to determine the operators to use at each step. %X tr-12-92.ps.gz %R TR-13-92 %D 1992 %T Timing-Based, Distributed Computation:\\ Algorithms and Impossibility Results %A Marios Mavronicolas %X Real distributed systems are subject to timing uncertainties: processes may lack a common notion of real time, or may even have only inexact information about the amount of real time needed for performing primitive computation steps. In this thesis, we embark on a study of the complexity theory of such systems and present combinatorial results that determine the inherent costs of some accomplishable tasks. We first consider {\em continuous-time} models, where processes obtain timing information from continuous- time clocks that run at the same rate as real time, but might not be initially synchronized. Due to an uncertainty in message delay time, absolute process synchronization is known to be impossible for such systems. We develop novel synchronization schemes for such systems and use them for building a distributed, {\em full caching} implementation of shared memory that supports {\em linearizability} . This implementation improves in efficiency over previous ones that support consistency conditions even weaker than linearizability and supports a quantitative degradation of the less frequently occurring operation. We present lower bound results which show that our implementation achieves efficiency close to optimal. We next turn to {\em discrete-time} models, where the time between any two consecutive steps of a process is in the interval $[c,1]$, for some constant $c$ such that $0 \leq c \leq 1$. We show time separation results between {\em asynchronous} and {\em semi-synchronous} such models, defined by taking $c=0$ and $c > 0$, respectively. Specifically, we use the {\em session problem} to show that the semi-synchronous model, for which the timing uncertainty, $\frac {1} {c} $, is bounded, is strictly more powerful than the asynchronous one under either message-passing or shared-memory interprocess communication. We also present tight lower and upper bounds on the degree of {\em precision} that can be achieved in the semi-synchronous model. Our combinatorial results shed some light on the capabilities and limitations of distributed systems subject to timing uncertainties. In particular, the main argument of this thesis is that the goal of designing distributed algorithms so that their logical correctness is timing-independent, whereas their performance might depend on timing assumptions, will not always be achievable: for some tasks, the only practical solutions might be strongly timing-dependent. %R TR-14-92 %D 1992 %T An Upper and a Lower Bound for Tick Synchronization %A Marios Mavronicolas %X The {\em tick synchronization problem} is defined and studied in the {\em semi-synchronous} network, where $n$ processes are located at the nodes of a complete graph and communicate by sending messages along links that correspond to its edges. An algorithm for the tick synchronization problem brings each process into a synchronized state in which the process makes an estimate of real time that is close enough to those of other processes already in a synchronized state. It is assumed that the (real) time for message delivery is at most $d$ and the time between any two consecutive steps of any process is in the interval $[c,1]$ for some $c$ such that $0 < c \leq 1$. We define the {\em precision} of a tick synchronization algorithm to be the maximum difference between estimates of real time made by different processes in a synchronized state, and propose it as a worst-case performance measure. We show that no such algorithm can guarantee precision less than $\lfloor \frac {d-2} {2c} \rfloor$. We also present an algorithm which achieves a precision of $\frac {2(n-1)} {n} (\lceil \frac {2d} {c} \rceil + \frac {d} {2} ) + \frac {1-c} {c} d+1$. %R TR-15-92 %D 1992 %T The Complexity of Learning Formulas and Decision Trees that have Restricted Reads %A Thomas Raysor Hancock %X Many learning problems can be phrased in terms of finding a close approximation to some unknown target formula $f$, based on observing $f$'s value on a sample of points either drawn at random according to some underlying distribution, or perhaps selected by a learner for algorithmic reasons. In this research our goal is to prove theorems about what classes of formulas permit such learning in polynomial time (using the definitions of either Valiant's PAC model or Angluin's exact identification model). In particular we take powerful classes of formulas whose learnability is unknown or provably intractable, and then consider restricted cases where the number of different times a single variable may appear in the formula is limited to a small constant. We prove positive learnability results in several such cases, given either added assumptions on the underlying distribution of random points or the ability of the learner to select some of the sample points. We provide polynomial time learning algorithms for decision trees and monotone disjunctive normal form (DNF) formulas when variables appear at most some arbitrary constant number of times, given that the sample points are chosen uniformly. Over arbitrary distributions, we show algorithms that chose their own sample points, besides using random examples, to closely approximate the same class of decision trees and the class of DNF formulas where variables appear at most twice. For arbitrary formulas, we give a number of algorithms for the read-once case (where variables appear only once) over different bases (the functions computed at formula's nodes). Besides identification algorithms for large classes of boolean read-once formulas, these results include new interpolation algorithms for classes of rational functions, and a membership query algorithm for a new class of neural networks. %R TR-16-92 %D 1992 %T Optimal Communication Channel Utilization for \\ Matrix Transposition and Related Permutations\\ on Binary Cubes %A S. Lennart Johnsson %A Ching-Tien Ho %X We present optimal schedules for permutations in which each node sends one or several unique messages to every other node. With concurrent communication on all channels of every node in binary cube networks, the number of element transfers in sequence for \ $ K $ \ elements per node is \ $ {K \over 2} , $ \ irrespective of the number of nodes over which the data set is distributed. For a succession of \ $ s $ \ permutations within disjoint subcubes of \ $ d $ \ dimensions each, our schedules yield \ $ \min ( {K \over 2} + (s - 1)d,(s + 3)d, {K \over 2} + 2 d) $ \ exchanges in sequence. The algorithms can be organized to avoid indirect addressing in the internode data exchanges, a property that increases the performance on some architectures. For message passing communication libraries, we present a blocking procedure that minimizes the number of block transfers while preserving the utilization of the communication channels. For schedules with optimal channel utilization, the number of block transfers for a binary \ $ d-$cube is \ $ d. $ The maximum block size for \ $ K $ \ elements per node is \ $ \lceil {K \over 2d} \rceil. $ %X tr-16-92.ps.gz %R TR-17-92 %D 1992 %T Parallel Sets: An Object-Oriented Methodology for Massively Parallel Programming %A Michael Francis Kilian %X Parallel programming has become the focus of much research in the past decade. As the limits of VLSI technology are tested, it becomes more apparent that parallel processors will be responsible for the next quantum leap in performance. Already parallel programming is responsible for significant advances not so much in the speed of solving problems, but in the size of problems that can be solved. Carefully crafted parallel programs are solving problems magnitudes larger than could be considered for serial machines. Object-oriented programming has also become popular in academia and perhaps even moreso in industry. O-O holds out the promise of being able to efficiently build large systems that are understandable, maintainable, and more robust. The programs targetted by O-O are different than those typically found running on a computer such as the Connection Machine. Parallel programs are often designed for very specific tasks; O-O programs' strengths are that they handle a wide variety of requirements. The thesis proposed here is that an object-oriented model of programming can be developed that is suitable for massively parallel processors. A set of criteria are developed for object- oriented parallel programming models and existing models are evaluated using this criteria. Given these criteria, the thesis presents a new way of thinking of parallel programs that builds upon an object-oriented foundation. A new basic type is added to the object model called Parallel-Set. Parallel sets are rigorously defined and then used to express complex communication between objects. The communication model is then extended to allow communication and synchronization protocols to be developed. The contribution of this work is that a wider range of reliable programs can be designed for use on parallel computers and that these programs will be easier to construct and understand. %R TR-18-92 %D 1992 %T Language and Compiler Issues in Scalable High\\ Performance Scientific Libraries %A S. Lennart Johnsson %X Library functions for scalable architectures must be designed to correctly and efficiently support any distributed data structure that can be created with the supported languages and associated compiler directives. Libraries must be designed also to support concurrency in each function evaluation, as well as the concurrent application of the functions to disjoint array segments, known as {\em multiple-instance} computation. Control over the data distribution is often critical for locality of reference, and so is the control over the interprocessor data motion. Scalability, while preserving efficiency, implies that the data distribution, the data motion, and the scheduling is adapted to the object shapes, the machine configuration, and the size of the objects relative to the machine size. The Connection Machine Scientific Software Library is a scalable library for distributed data structures. The library is designed for languages with an array syntax. It is accessible from all supported languages ($\ast$Lisp, C$\ast$,CM-Fortran, and Paris (PARalel Instruction Set) in combination with Lisp, C, and Fortran 77). Single library calls can manage both concurrent application of a function to disjoint array segments, as well as concurrency in each application of a function. The control of the concurrency is independent of the control constructs provided in the high-level languages. Library functions operate efficiently on any distributed data structure that can be defined in the high-level languages and associated directives. Routines may use their own internal data distribution for efficiency reasons. The algorithm invoked by a call to a library function depends upon the shapes of the objects involved, their sizes and distribution, and upon the machine shape and size. %X tr-18-92.ps.gz %R TR-19-92 %D 1992 %T Lessons from a Restricted Turing Test %A Stuart M. Shieber %X We report on the recent Loebner prize competition inspired by Turing's test of intelligent behavior. The presentation covers the structure of the competition and the outcome of its first instantiation in an actual event, and an analysis of the purpose, design, and appropriateness of such a competition. We argue that the competition has no clear purpose, that its design prevents any useful outcome, and that such a competition is inappropriate given the current level of technology. We then speculate as to suitable alternatives to the Loebner prize. %X tr-19-92.ps.gz %R TR-20-92 %D 1992 %T An Efficient Algorithm for Gray-to-Binary Permutation\\ on Hypercubes %A Ching-Tien Ho %A M.T. Raghunath %A S. Lennart Johnsson %X Both Gray code and binary code are frequently used in mapping arrays into hypercube architectures. While the former is preferred when communication between adjacent array elements is needed, the latter is preferred for FFT-type communication. When different phases of computations have different types of communication patterns, the need arises to remap the data. We give a nearly optimal algorithm for permuting data from a Gray code mapping to a binary code mapping on a hypercube with communication restricted to one input and one output channel per node at a time. Our algorithm improves over the best previously known algorithm [6] by nearly a factor of two and is optimal to within a factor of \ $ n/(n-1) $ \ with respect to data transfer time on an \ $n$-cube. The expected speedup is confirmed by measurements on an Intel iPSC/2 hypercube. %X tr-20-92.ps.gz %R TR-21-92 %D 1992 %T Physically Realistic Trajectory Planning in Animation:\\ A Stimulus-Response Approach %A J. Thomas Ngo %A Joe Marks %X Trajectory-planning problems arise in subject to physical law other constraints on their motion. Witkin Kass dubbed this class of problems ``Spacetime Constraints'' (SC) presented results for specific problems involving an articulated f SC problems are typically multimodal discontinuous the number of decision alternatives available at each time step ca constructing even coarse trajectories for subsequent optimization without directive input from the user. Rather than use a time-dom which might be appropriate for local optimization our algorithm uses a stimulus-response model. Locomotive skills a which chooses stimulus-response parameters using a parallel geneti succeeds in finding good novel solutions for a test suite of SC problems involving unbranch %R TR-22-92 %D 1992 %T Pronouns, Names, and the Centering of Attention in Discourse %A Peter C. Gordon %A Barbara J. Grosz %A Laura A. Gilliom %X Centering theory, developed within computational linguistics, provides an account of ways in which patterns of inter-utterance reference can promote the local coherence of discourse. It states that each utterance in a coherent discourse segment contains a single semantic entity -- the backward-looking center -- that provides a link to the previous utterance, and an ordered set of entities -- the forward-looking centers -- that offer potential links to the next utterance. We report five reading-time experiments that test predictions of this theory with respect to the conditions under which it is preferable to realize (refer to) an entity using a pronoun rather than a repeated definite description or name. The experiments show that there is a single backward-looking center that is preferentially realized as a pronoun, and that the backward- looking center is typically realized as the grammatical subject of the utterance. They also provide evidence that there is a set of forward-looking centers that is ranked in terms of prominence and that a key factor in determining prominence, surface-initial position, does not affect determination of the backward-looking center. This provides evidence for the dissociation of the coherence processes of looking backward and looking forward. %R TR-23-92 %D 1992 %T Communication Primitives for Unstructured Finite Element Simulations on Data Parallel Architectures %A Kapil K. Mathur %A S. Lennart Johnsson %X Efficient data motion is critical for high performance computing on distributed memory architectures. The value of some techniques for efficient data motion is illustrated by identifying generic communication primitives. Further, the efficiency of these primitives is demonstrated on three different applications using the finite element method for unstructured grids and sparse solvers with different communication requirements. For the applications presented, the techniques advocated reduced the communication times by a factor of between 1.5 - 3. %X tr-23-92.ps.gz %R TR-24-92 %D 1992 %T A Combining Mechanism for Parallel Computers %A Leslie G. Valiant %X In a multiprocessor computer communication among the components may be based either on a simple router, which delivers messages point-to-point like a mail service, or on a more elaborate combining network that, in return for a greater investment in hardware, can combine messages to the same address prior to delivery. This paper describes a mechanism for recirculating messages in a simple router so that the added functionality of a combining network, for arbitrary access patterns, can be achieved by it with reasonable efficiency. The method brings together the messages with the same destination address in more than one stage, and at a set of components that is determined by a hash function and decreases in number at each stage. %X tr-24-92.ps.gz %R TR-25-92 %D 1992 %T Labeling Point Features on Maps and Diagrams %A Jon Christensen %A Joe Marks %A Stuart Shieber %X (Revised 6/94; includes color figures.) A major factor affecting the clarity of graphical displays that include text labels is the degree to which labels obscure display features (including other labels) as a result of spatial overlap. Point-feature label placement (PFLP) is the problem of placing text labels adjacent to point features on a map or diagram so as to maximize legibility. This problem occurs frequently in the production of many types of informational graphics, though it arises most often in automated cartography. In this paper we present a comprehensive treatment of the PFLP problem, viewed as a type of combinatorial optimization problem. Complexity analysis reveals that the basic PFLP problem and most interesting variants of it are NP-hard. These negative results help inform a survey of previously reported algorithms for PFLP; not surprisingly, all such algorithms either have exponential time complexity or are incomplete. To solve the PFLP problem in practice, then, we must rely on good heuristic methods. We pro pose two new methods, one based on a discrete form of gradient descent, the other on simulated annealing, and report on a series of empirical tests comparing these and the other known algorithms for the problem. Based on this study, the first to be conducted, we identify the best approaches as a function of available computation time. %X tr-25-92.ps.gz %R TR-26-92 %D 1992 %T Why BSP Computers? %A L.G. Valiant %R TR-27-92 %D 1992 %T Variations on Incremental Interpretation %A Stuart M. Shieber %A Mark Johnson %X tr-27-92.ps.gz %R TR-28-92 %D 1992 %T A Fixpoint Theory of Nonmonotonic Functions and \\ Its Applications to Logic Programs, Deductive Databases\\ and %A Yuli Zhou %X In this thesis we shall employ denotational (fixpoint) methods to study the computations of rule systems based on first order logic. The resulting theory parallels and further strengthens the fixpoint theory of {\it stratified logic programs} developed by Apt, Blair and Walker, and we shall consider two principal applications of the theory to logic programs and to production rule systems. %R TR-29-92 %D 1992 %T Massively Parallel Computing: Data distribution and\\ communication %A S. Lennart Johnsson %X We discuss some techniques for preserving locality of reference in index spaces when mapped to memory units in a distributed memory architecture. In particular, we discuss the use of multidimensional address spaces instead of linearized address spaces, partitioning of irregular grids, and placement of partitions among nodes. We also discuss a set of communication primitives we have found very useful on the Connection Machine systems in implementing scientific and engineering applications. We briefly review some of the techniques used to fully utilize the bandwidth of the binary cube network of the CM-2 and CM-200, and give some performance data from implementations of commumication primitives. %X tr-29-92.ps.gz %R TR-30-92 %D 1992 %T Optimal Computing on Mesh-Connected Processor Arrays %A Christos Ioannis Kaklamanis %X In this thesis, we present and analyze new algorithms for routing, sorting and dynamic searching on mesh-connected arrays of processors; also we present a lower bound concerning embeddings on faulty arrays. In particular, we first consider the problem of permutation routing in two- and three-dimensional mesh-connected processor arrays. We present new on-line and off-line routing algorithms, all of which are optimal to within a small additive term. Then, we show that sorting an input of size $N = n^2$ can be performed by an $n \times n$ mesh-connected processor array in $2n + o(n)$ parallel communication steps and using constant-size queues, with high probability. This result is optimal to within a low order additive term, realizing the obvious diameter lower bound. Our techniques can be applied to higher dimensional meshes as well as torus-connected networks, achieving significantly better bounds than the known results. Furthermore, we investigate the parallel complexity of the backtrack and branch-and-bound search on the mesh-connected array. We present an $\Omega(\sqrt {dN} /\sqrt {\log N} )$ lower bound for the time needed by a {\em randomized} algorithm to perform backtrack and branch-and-bound search of a tree of depth $d$ on the $\sqrt {N} \times \sqrt {N} $ mesh, even when the depth of the tree is known in advance. For the upper bounds we give {\em deterministic} algorithms that are within a factor of $O(\log^ {{3} \over {2}} N)$ from our lower bound. Our algorithms do not make any assumption on the shape of the tree to be searched. Our algorithm for branch-and-bound is the first algorithm that performs branch-and-bound search on a sparse network. Both the lower and the upper bounds extend to higher dimension meshes. %R TR-31-92 %D 1992 %T An Algebraic Approach to the Compilation and Operational\\ Semantics of Functional Languages with I-structures %A Zena Matilde Ariola %X Modern languages are too complex to be given direct operational semantics. For example, the operational semantics of functional languages has traditionally been given by translating them to the $\lambda$-calculus extended with constants. Compilers do a similar translation into an intermediate form in the process of generating code for a machine. A compiler then performs optimizations on this intermediate form before generating machine code. In this thesis we show that the intermediate form can actually be the kernel language. In fact, we may translate the kernel language into still lower-level language(s), where more machine oriented or efficiency related concerns can be expressed directly. Furthermore, compiler optimizations may be expressed as source-to-source transformations on the intermediate languages. We introduce two implicitly parallel languages, Kid (Kernel Id) and P-TAC (Parallel Three Address Code), respectively, and describe the compilation process of Id in terms of a translation of Id into Kid, and of Kid into P-TAC\@. In this thesis we do not describe the compilation process below the P-TAC level. However, we show that our compilation process allows the formalization of questions related to the correctness of the optimizations. We also give the operational semantics of Id indirectly by its translation into Kid and a well-defined operational semantics for Kid. Kid and P-TAC are examples of Graph Rewriting Systems (GRSs), which are introduced to capture sharing of computation precisely. Sharing of subexpressions is important both semantically (e.g., to model side-effects) and pragmatically (e.g., to reason about complexity). Our GRSs extend Barendregt's Term Graph Rewriting Systems to include cyclic graphs and cyclic rules. We present a term model for GRSs along the lines of L\'evy's term model for $\lambda$-calculus, and show its application to compiler optimizations. We also show that GRS reduction is a correct implementation of term rewriting. %R TR-01-93 %D 1993 %T Massively Parallel Computing: \\ Mathematics and communications libraries %A S. Lennart Johnsson %A Kapil K. Mathur %X Massively parallel computing holds the promise of extreme performance. The utility of these systems will depend heavily upon the availability of libraries until compilation and run-time system technology is developed to a level comparable to what today is common on most uniprocessor systems. Critical for performance is the ability to exploit locality of reference and effective management of the communication resources. We discuss some techniques for preserving locality of reference in distributed memory architectures. In particular, we discuss the benefits of multidimensional address spaces instead of the conventional linearized address spaces, partitioning of irregular grids, and placement of partitions among nodes. Some of these techniques are supported as language directives, others as run-time system functions, and others still are part of the Connection Machine Scientific Software Library, CMSSL. We briefly discuss some of the unique design issues in this library for distributed memory architectures, and some of the novel ideas with respect to managing data allocation, and automatic selection of algorithms with respect to performance. The CMSSL also includes a set of communication primitives we have found very useful on the Connection Machine systems in implementing scientific and engineering applications. We briefly review some of the techniques used to fully utilize the bandwidth of the binary cube network of the CM-2 and CM-200 Connection Machine systems. %X tr-01-93.ps.gz %R TR-02-93 %D 1993 %T All-to-All Communication on the Connection Machine CM-200 %A Kapil K. Mathur %A S. Lennart Johnsson %X Detailed algorithms for all-to-all broadcast and reduction are given for arrays mapped by binary or binary-reflected Gray code encoding to the processing nodes of binary cube networks. Algorithms are also given for the local computation of the array indices for the communicated data, thereby reducing the demand for communications bandwidth. For the Connection Machine system CM- 200, Hamiltonian cycle based all-to-all communication algorithms yields a performance that is a factor of two to ten higher than the performance offered by algorithms based on trees, butterfly networks, or the Connection Machine router. The peak data rate achieved for all-to-all broadcast on a 2048 node Connection Machine system CM-200 is 5.4 Gbytes/sec when no reordering is required. If the time for data reordering is included, then the effective peak data rate is reduced to 2.5 Gbytes/sec. %X tr-02-93.ps.gz %R TR-03-93 %D 1993 %T Topics in Parallel and Distributed Computation %A Alexandros Gerbessiotis %X With advances in communication technology, the introduction of multiple-instruction multiple-data parallel computers and the increasing interest in neural networks, the fields of parallel and distributed computation have received increasing attention in recent years. We study in this work the bulk-synchronous parallel model, that attempts to bridge the software and hardware worlds with respect to parallel computing. It offers a high level abstraction of the hardware with the purpose of allowing parallel programs to run efficiently on diverse hardware platforms. We examine direct algorithms on this model and also give simulations of other models of parallel computation on this one as well as on models that bypass it. While the term parallel computation refers to the execution of a single or a set of closely coupled tasks by a set of processors, the term distributed computation refers to more loosely coupled or uncoupled tasks being executed at different locations. In a distributed computing environment it is sometimes necessary that one computer send to the remaining ones various pieces of information. The term broadcasting is used to describe the dissemination of information from one computer to the others in such an environment. We examine various classes of random graphs with respect to broadcasting and establish results related to the minimum time required to perform broadcasting from any vertex of such graphs. Various models of the human brain, as a collection of distributed elements working in parallel, have been proposed. Such elements are connected together in a network. The network in the human brain is sparse. How such sparse networks of simple elements can perform any useful computation is a topic currently little understood. We examine in this work a graph construction problem as it relates to neuron allocation in a model of neural networks recently proposed. We also examine a certain class of random graphs with respect to this problem and establish various results related to the distribution of the sizes of sets of neurons when various learning tasks are performed on this model. Experimental results are also presented and compared to theoretically derived ones. %R TR-04-93 %D 1993 %T Infrastructure for Research towards Ubiquitous Information\\ Systems %A Barbara Grosz %A H.T. Kung %A Margo Seltzer %A Stuart Shieber\\ %A Michael Smith %R TR-05-93 %D 1993 %T An Equational Framework for the Flow Analysis of Higher Order Functional Programs %A Dan Stefanescu %A Yuli Zhou %X This paper presents a novel technique for the static analysis of functional programs. The method uses the original Cousots' framework expanded by a syntactic based abstraction methodology. The main idea is to represent each computational entity in a functional program in relation to its (concrete) call string, i.e. the string of function calls leading to its computation. Furthermore, the abstraction criteria consists in choosing a relation of equivalence over the set of all call strings. Based on this relation of equivalence, the method generates a monotonic system of equations such that its least solution is the desired result of the analysis. This approach generalizes previous techniques (0CFA, 1CFA, etc.) in flow analyses and allows for program-directed design of frameworks for approximate analysis of programs. The method is proven correct with respect to a rewriting system based operational semantics. %R TR-06-93 %D 1993 %T An Efficient Communication Strategy for Finite Element\\ Methods on the Connection Machine CM-5 System %A Zden\v { e } k Johan %A Kapil K. Mathur %A S. Lennart Johnsson %A Thomas J.R. Hughes %X Performance of finite element solvers on parallel computers such as the Connection Machine CM-5 system is directly related to the efficiency of the communication strategy. The objective of this work is two-fold: First, we propose a data-parallel implementation of a partitioning algorithm used to decompose unstructured meshes. The mesh partitions are then mapped to the vector units of the CM- 5. Second, we design gather and scatter operations taking advantage of data locality coming from the decomposition to reduce the communication time. This new communication strategy is available in the CMSSL [8]. An example illustrates the performance of the proposed strategy. %X tr-06-93.ps.gz %R TR-07-93 %D 1993 %T All-to-All Communication Algorithms for Distributed BLAS %A Kapil K. Mathur %A S. Lennart Johnsson %X Dense Distributed Basic Linear Algebra Subroutine (DBLAS) algorithms based on all-to-all broadcast and all-to-all reduce are presented. For DBLAS, at each all-to-all step, it is necessary to know the data values and the indices of the data values as well. This is in contrast to the more traditional applications of all-to -all broadcast (such as a N-body solver) where the identity of the data values is not of much interest. Detailed schedules for all- to-all broadcast and reduction are given for the data motion of arrays mapped to the processing nodes of binary cube networks using binary encoding and binary-reflected Gray encoding. The algorithms compute the indices for the communicated data locally. No communication bandwidth is consumed for data array indices. For the Connection Machine system CM-200, Hamiltonian cycle based all-to-all communication algorithms improves the performance by a factor of two to ten over a combination of tree, butterfly network, and router based algorithms. The data rate achieved for all-to-all broadcast on a 256 node Connection Machine system CM- 200 is 0.3 Gbytes/sec. The data motion rate for all-to-all broadcast, including the time for index computations and local data reordering, is about 2.8 Gbytes/sec for a 2048 node system. Excluding the time for index computation and local memory reordering the measured data motion rate for all-to-all broadcast is 5.6 Gbytes/s. On a Connection Machine system, CM-200, with 2048 processing nodes, the overall performance of the distributed matrix vector multiply (DGEMV) and vector matrix multiply (DGEMV with TRANS) is 10.5 Gflops/s and 13.7 Gflops/s respectively. %X tr-07-93.ps.gz %R TR-08-93 %D 1993 %T Massively Parallel Computing: \\ Unstructured Finite Element Simulations %A Kapil K. Mathur %A Zden\v { e } k Johan %A S. Lennart Johnsson %A Thomas J.R. Hughes %X Massively parallel computing holds the promise of extreme performance. Critical for achieving high performance is the ability to exploit locality of reference and effective management of the communication resources. This article describes two communication primitives and associated mapping strategies that have been used for several different unstructured, three- dimensional, finite element applications in computational fluid dynamics and structural mechanics. %X tr-08-93.ps.gz %R TR-09-93 %D 1993 %T The Connection Machine Systems CM-5 %A S. Lennart Johnsson %X The Connection Machine system CM-5 is a parallel computing system scalable to Tflop performance, hundreds of Gbytes of primary storage, Tbytes of secondary storage and Gbytes/s of I/O bandwidth. The system has been designed to be scalable over a range of up to three orders of magnitude. We will discuss the design goals, innovative software and hardware features of the CM- 5 system, and some experience with the system. %X tr-09-93.ps.gz %R TR-10-93 %D 1993 %T Constraint-Driven Diagram Layout %A Ed Dengler %A Mark Friedell %A Joe Marks %X Taking both perceptual organization and aesthetic criteria into account is the key to high-quality diagram layout, but makes for a more difficult problem than pure aesthetic layout. Computing the layout of a network diagram that exhibits a specified perceptual organization can be phrased as a constraint-satisfaction problem. Some constraints are derived from the perceptual-organization specification: the nodes in the diagram must be positioned so that they form specified perceptual gestalts, i.e., certain groups of nodes must form perceptual groupings by proximity, or symmetry, or shape motif, etc. Additional constraints are derived from aesthetic considerations: the layout should satisfy criteria that concern the number of link crossings, the sum of link lengths, or diagram area, etc. Using a generalization of a simple mass-spring layout technique to ``satisfice'' constraints, we show how to produce high-quality layouts with specified perceptual organization for medium-sized diagrams (10--30 nodes) in under 30 seconds on a workstation. %X tr-10-93.ps.gz %R TR-11-93 %D 1993 %T An Efficient Communication Strategy for Finite Element\\ Methods on the Connection Machine CM-5 System %A Zden\v { e } k Johan %A Kapil K. Mathur %A S. Lennart Johnsson %A Thomas J.R. Hughes %X The objective of this paper is to propose communication procedures suitable for unstructured finite element solvers implemented on distributed-memory parallel computers such as the Connection Machine CM-5 system. First, a data-parallel implementation of the recursive spectral bisection (RSB) algorithm proposed by Pothen {\em et al.} is presented. The RSB algorithm is associated with a node renumbering scheme which improves data locality of reference. Two-step gather and scatter operations taking advantage of this data locality are then designed. These communication primitives make use of the indirect addressing capability of the CM-5 vector units to achieve high gather and scatter bandwidths. The efficiency of the proposed communication strategy is illustrated on large-scale three-dimensional fluid dynamics problems. %X tr-11-93.ps.gz %R TR-12-93 %D 1993 %T Aligning Sentences in Bilingual Corpora Using Lexical\\ Information %A Stanley F. Chen %X In this paper, we describe a fast algorithm for aligning sentences with their translations in a bilingual corpus. Existing efficient algorithms ignore word identities and only consider sentence length (Brown {\it et al.} , 1991; Gale and Church 1991). Our algorithm constructs a simple statistical word-to-word translation model on the fly during alignment. We find the alignment that maximizes the probability of generating the corpus with this translation model. We have achieved an error rate of approximately 0.4\% on Canadian Hansard data, which is a significant improvement over previous results. The algorithm is language independent. %R TR-13-93 %D 1993 %T Compaction and Separation Algorithms for Non-Convex Polygons and Their Applications %A Zhenyu Li %A Victor Milenkovic %X Given a two dimensional, non-overlapping layout of convex and non- convex polygons, compaction can be thought of as simulating the motion of the polygons as a result of applied ``forces.'' We apply compaction to improve the material utilization of an already tightly packed layout. Compaction can be modeled as a motion of the polygons that reduces the value of some functional on their positions. Optimal compaction, planning a motion that reaches a layout that has the global minimum functional value among all reachable layouts, is shown to be NP-complete under certain assumptions. We first present a compaction algorithm based on existing physical simulation approaches. This algorithm uses a new velocity-based optimization model. Our experimental results reveal the limitation of physical simulation: even though our new model improves the running time of our algorithm over previous simulation algorithms, the algorithm still can not compact typical layouts of one hundred or more polygons in a reasonable amount of time. The essential difficulty of physical based models is that they can only generate velocities for the polygons, and the final positions must be generated by numerical integration. We present a new position-based optimization model that allows us to calculate directly new polygon positions via linear programming that are at a local minimum of the objective. The new model yields a translational compaction algorithm that runs two orders of magnitude faster than physical simulation methods. We also consider the problem of separating overlapping polygons using a minimal amount of motion and show it to be NP-complete. Although this separation problem looks quite different from the compaction problem, our new model also yields an efficient algorithm to solve it. The compaction/separation algorithms have been applied to marker making: the task of packing polygonal pieces on a sheet of cloth of fixed width so that total length is minimized. The compaction algorithm has improved cloth utilization of human generated pants markers. The separation algorithm together with a database of human-generated markers can be used for automatic generation of markers that approach human performance. %X tr-13-93.ps.gz %R TR-14-93 %D 1993 %T Universal Boolean Judges and Their Characterization %A Eyal Kushilevitz %A Silvio Micali %A Rafail Ostrovsky %X We consider the classic problem of $n$ honest (but curious) players with private inputs $x_1,\ldots, x_n$ who wish to compute the value of some pre-determined function $f(x_1,\ldots,x_n)$, so that at the end of the protocol every player knows the value of $f(x_1,\ldots,x_n)$. The players have unbounded computational resources and they wish to compute $f$ in a totally {\em private\/} ($n$-private) way. That is, after the completion of the protocol, which all players honestly follow, no coalition (of arbitrary size) can infer any information about the private inputs of the remaining players above of what is already been revealed by the value of $f(x_1,\ldots,x_n)$. Of course, with the help of a {\em trusted judge for computing $f$} , players can trivially compute $f$ in a totally private manner: every player secretly gives his input to the trusted judge and she announces the result. Previous research was directed towards implementing such a judge ``mentally'' by the players themselves, and was shown possible under various assumptions. Without assumptions, however, it was shown that most functions {\em can not\/} be computed in a totally private manner and thus we must rely on a trusted judge. If we have a trusted judge for $f$ we are done. Can we use a judge for a ``simpler'' function $g$ in order to compute $f$ $n$- privately? In this paper we initiate the study of the {\em complexity} of such judges needed to achieve total privacy for arbitrary $f$. We answer the following two questions: {\em How complicated such a judge should be, compared to $f$?} and {\em Does there exists some judge which can be used for all $f$?} We show that there exists {\bf universal boolean} judges (i.e. the ones that can be used for any $f$) and give a complete characterization of all the boolean functions which describe universal judges. In fact, we show, that a judge computing {\em any\/} boolean function $g$ which itself cannot be computed $n$- privately (i.e., when there is no judge available) is {\em universal\/} . Thus, we show that for all boolean functions, the notions of {\bf universality\/} and {\bf $n$-privacy} are {\em complimentary\/} . On the other hand, for non-boolean functions, we show that this two notions are {\em not\/} complimentary. Our result can be viewed as a strong generalization of the two-party case, where Oblivious Transfer protocols were shown to be universal. %R TR-15-93 %D 1993 %T An Unbundled Compiler %A Thomas Cheatham %R TR-16-93 %D 1993 %T Actions, Beliefs and Intentions in Multi-Action Utterances %A Cecile Tiberghien Balkanski %X Multi-action utterances convey critical information about agents' beliefs and intentions with respect to the actions they talk about or perform. Two such utterances may, for example, describe the same actions while the speakers of these utterances hold beliefs about these actions that are diametrically opposed. Hence, for a language interpretation system to understand multi-action utterances, it must be able (1) to determine the actions that are described and the ways in which they are related, and (2) to draw appropriate inferences about the agents' mental states with respect to these actions and action relations. This thesis investigates the semantics of two particular multi-action constructions: utterances with means clauses and utterances with rationale clauses. These classes of utterances are of interest not only as exemplars of multi-action utterances, but also because of the subtle differences in information that can be felicitously inferred from their use. Their meaning is shown to depend on the beliefs and intentions of the speaker and agents whose actions are being described as well as on the actions themselves. Thus, the thesis demonstrates (a) that consideration of mental states cannot be reserved to pragmatics and (b) that other aspects of natural language interpretation besides the interpretation of mental state verbs or plan recognition may provide information about mental states. To account for this aspect of natural language interpretation, this thesis presents a theory of logical form, a theory of action and action relations, an axiomatization of belief and intention, and interpretation rules for means clauses and rationale clauses. Together these different pieces constitute an interpretation model that meets the requirements specified in (1) and (2) above and that predicts the set of beliefs and intentions shown to be characteristic of utterances with means clauses and rationale clauses. This model has been implemented in the MAUI system (Multi-Action Utterance Interpreter), which accepts natural language sentences from a user, computes their logical form, and answers questions about the beliefs and intentions of the speaker and actor regarding the actions and action relations described. %R TR-17-93 %D 1993 %T Stochastic Approximation Algorithms for Number Partitioning %A Wheeler Ruml %X This report summarizes research on algorithms for finding particularly good solutions to instances of the NP-complete number -partitioning problem. Our approach is based on stochastic search algorithms, which iteratively improve randomly chosen initial solutions. Instead of searching the space of all $2^ {n-1} $ possible partitionings, however, we use these algorithms to manipulate indirect encodings of candidate solutions. An encoded solution is evaluated by a decoder, which interprets the encoding as instructions for constructing a partitioning of a given problem instance. We present several different solution encodings, including bit strings, permutations, and rule sets, and describe decoding algorithms for them. Our empirical results show that many of these encodings restrict and reshape the solution space in ways that allow relatively generic search methods, such as hill climbing, simulated annealing, and the genetic algorithm, to find solutions that are often as good as those produced by the best known constructive heuristic, and in many cases far superior. For the algorithms and representations we consider, the choice of solution representation plays an even greater role in determining performance than the choice of search algorithm. %X tr-17-93.ps.gz %R TR-18-93 %D 1993 %T POLYSHIFT Communications Software for the Connection Machine System CM-200 %A William George %A Ralph G. Brickner %A S. Lennart Johnsson %X We describe the use and implementation of a polyshift function {\bf PSHIFT} for circular shifts and end-off shifts. Polyshift is useful in many scientific codes using regular grids, such as finite difference codes in several dimensions, multigrid codes, molecular dynamics computations, and in lattice gauge physics computations, such as quantum Chromodynamics (QCD) calculations. Our implementation of the {\bf PSHIFT} function on the Connection Machine systems CM-2 and CM-200 offers a speedup of up to a factor of 3-4 compared to {\bf CSHIFT} when the local data motion within a node is small. The {\bf PSHIFT} routine is included in the Connection Machine Scientific Software Library (CMSSL). %X tr-18-93.ps.gz %R TR-19-93 %D 1993 %T High Performance, Scalable Scientific Software Libraries %A S. Lennart Johnsson %A Kapil K. Mathur %X Massively parallel processors introduces new demands on software systems with respect to performance, scalability, robustness and portability. The increased complexity of the memory systems and the increased range of problem sizes for which a given piece of software is used, poses serious challenges to software developers. The Connection Machine Scientific Software Library, CMSSL, uses several novel techniques to meet these challenges. The CMSSL contains routines for managing the data distribution and provides data distribution independent functionality. High performance is achieved through careful scheduling of operations and data motion, and through the automatic selection of algorithms at run-time. We discuss some of the techniques used, and provide evidence that CMSSL has reached the goals of performance and scalability for an important set of applications. %X tr-19-93.ps.gz %R TR-20-93 %D 1993 %T A Collaborative Planning Approach to Discourse Understanding %A Karen Lochbaum %X Approaches to discourse understanding fall roughly into two categories: those that treat mental states as elemental and thus reason directly about them, and those that do not reason about the beliefs and intentions of agents themselves, but about the propositions and actions that might be considered objects of those beliefs and intentions. The first type of approach is a mental phenomenon approach, the second a data-structure approach (Pollack, 1986b). In this paper, we present a mental phenomenon approach to discourse understanding and demonstrate its advantages over the data-structure approaches used by other researchers. The model we present is based on the collaborative planning framework of SharedPlans (Grosz and Sidner, 1990; Lochbaum, Grosz, and Sidner, 1990; Grosz and Kraus, 1993). SharedPlans are shown to provide a computationally realizable model of the intentional component of Grosz and Sidner's theory of discourse structure (Grosz and Sidner, 1986). Additionally, this model is shown to simplify and extend approaches to discourse understanding that introduce multiple types of plans to model an agent's motivations for producing an utterance (Litman, 1985; Litman and Allen, 1987; Ramshaw, 1991; Lambert and Carberry, 1991). %X tr-20-93.ps.gz %R TR-21-93 %D 1993 %T Evolving Line Drawings %A Ellie Baker %A Margo Seltzer %X This paper explores the application of interactive genetic algorithms to the creation of line drawings. We have built a system that starts with a collection of drawings that are either randomly generated or input by the user. The user selects one such drawing to mutate or two to mate, and a new generation of drawings is produced by randomly modifying or combining the selected drawing(s). This process of selection and procreation is repeated many times to evolve a drawing. A wide variety of complex sketches with highlighting and shading can be evolved from very simple drawings. This technique has enormous potential for augmenting and enhancing the power of traditional computer-aided drawing tools, and for expanding the repertoire of the computer- assisted artist. %X tr-21-93.ps.gz %R TR-22-93 %D 1993 %T A Stencil Compiler for the Connection Machine Models\\ CM-2/200 %A Ralph G. Brickner %A William George %A S. Lennart Johnsson %A \\ %A Alan Ruttenberg %X In this paper we present a Stencil Compiler for the Connection Machine Models CM-2 and CM-200. A {\em stencil} is a weighted sum of circularly-shifted CM Fortran arrays. The stencil compiler optimizes the data motion between processing nodes, minimizes the data motion within a node, and minimizes the data motion between registers and local memory in a node. The compiler makes novel use of the communication system and has highly optimized register use. The compiler natively supports two-dimensional stencils, but stencils in three or four dimensions are automatically decomposed. Portions of the system are integrated as part of the CM Fortram programming system, and also as part of the system microcode. The compiler is available as part of the Connection Machine Scientific Software Library (CMSSL) Release 3.1. %X tr-22-93.ps.gz %R TR-23-93 %D 1993 %T CMSSL: A Scalable Scientific Software Library %A S. Lennart Johnsson %X Massively parallel processors introduce new demands on software systems with respect to performance, scalability, robustness and portability. The increased complexity of the memory systems and the increased range of problem sizes for which a given piece of software is used poses serious challenges for software developers. The Connection Machine Scientific Software Library, CMSSL, uses several novel techniques to meet these challenges. The CMSSL contains routines for managing the data distribution and provides data distribution independent functionality. High performance is achieved through careful scheduling of operations and data motion, and through the automatic selection of algorithms at run-time. We discuss some of the techniques used, and provide evidence that CMSSL has reached the goals of performance and scalability for an important set of applications. %X tr-23-93.ps.gz %R TR-24-93 %D 1993 %T Annotating Floor Plans Using Deformable Polygons %A Kathy Ryall %A Joe Marks %A Murray Mazer %A Stuart Shieber %X The ability to recognize regions in a bitmap image has applications in various areas, from document recognition of scanned building floor plans to processing of scanned forms. We consider the use of deformable polygons for delineating partially or fully bounded regions of a scanned bitmap that depicts a building floor plan. We discuss a semi-automated interactive system, in which a user positions a seed polygon in an area of interest in the image. The computer then expands and deforms the polygon in an attempt to minimize an energy function that is defined so that configurations with minimum energy tend to match the subjective boundaries of regions in the image. When the deformation process is completed, the user may edit the deformed polygon to make it conform more closely to the desired region. In contrast to area-filling techniques for delineating areal regions of images, our approach works robustly for partially bounded regions. %X tr-24-93.ps.gz %R TR-01-94 %D 1994 %T Reasoning with Models %A Roni Khardon %A Dan Roth %X We develop a model-based approach to reasoning, in which the knowledge base is represented as a set of models (satisfying assignments) rather then a logical formula, and the set of queries is restricted. We show that for every propositional knowledge base (KB) there exists a set of {\em characteristic models} with the property that a query is true in KB if and only if it is satisfied by the models in this set. We fully characterize a set of theories for which the model-based representation is compact and provides efficient reasoning. These include cases where the formula-based representation does not support efficient reasoning. In addition, we consider the model-based approach to {\em abductive reasoning} and show that for any propositional KB, reasoning with its model-based representation yields an abductive explanation in time that is polynomial in its size. Some of our technical results make use of the {\em Monotone Theory}, a new characterization of Boolean functions introduced. \par The notion of {\em restricted queries} is inherent to our approach. This is a wide class of queries for which reasoning is very efficient and exact, even when the model-based representation KB provides only an approximate representation of the ``world''. \par Moreover, we show that the theory developed here generalizes the model-based approach to reasoning with Horn theories, and captures even the notion of reasoning with Horn-approximations. Our result characterizes the Horn theories for which the approach suggested in is useful and the phenomena observed there, regarding the relative sizes of the formula-based representation and model-based representation of KB is explained and put in a wider context. %X tr-01-94.ps.gz %R TR-02-94 %D 1994 %T Learning to Reason %A Roni Khardon %A Dan Roth %X We introduce a new framework for the study of reasoning. The Learning (in order) to Reason approach developed here combines the interfaces to the world used by known learning models with the reasoning task and a performance criterion suitable for it. In this framework the intelligent agent is given access to her favorite learning interface, and is also given a grace period in which she can interact with this interface and construct her representation KB of the world $W$. Her reasoning performance is measured only after this period, when she is presented with queries $\alpha$ from some query language, relevant to the world, and has to answer whether $W$ implies $\alpha$. \par The approach is meant to overcome the main computational difficulties in the traditional treatment of reasoning which stem from its separation from the ``world''. First, by allowing the reasoning task to interface the world (as in the known learning models), we avoid the rigid syntactic restriction on the intermediate knowledge representation. Second, we make explicit the dependence of the reasoning performance on the input from the environment. This is possible only because the agent interacts with the world when constructing her knowledge representation. \par We show how previous results from learning theory and reasoning fit into this framework and illustrate the usefulness of the Learning to Reason approach by exhibiting new results that are not possible in the traditional setting. First, we give a Learning to Reason algorithm for a class of propositional languages for which there are no efficient reasoning algorithms, when represented as a traditional (formula-based) knowledge base. Second, we exhibit a Learning to Reason algorithm for a class of propositional languages that is not known to be learnable in the traditional sense. %X tr-02-94.ps.gz %R TR-03-94 %D 1994 %T Mesh Decomposition and Communication Procedures for Finite Element Applications on The Connection Machine CM-5 System %A Z. Johan %A K.K. Mathur %A S.L. Johnsson %A T.J.R. Hughes %X TR-08-94 SUPERCEDES TR-03-94. %R TR-04-94 %D 1994 %T Data Parallel Finite Element Techniques for Compressible Flow Problems %A Z. Johan %A K.K. Mathur %A S.L. Johnsson %A T.J.R. Hughes %X We present a brief description of a finite element solver implemented on the Connection Machine CM-5 system. A more detailed presentation of the issues involved in such an implementation can be found in [1,2]. %X tr-04-94.ps.gz %R TR-05-94 %D 1994 %T Motion-Synthesis Techniques for 2D Articulated Figures %A Alex Fukunaga %A Lloyd Hsu %A Peter Reiss %A Andrew Shuman %A Jon Christensen %A Joe Marks %A J. Thomas Ngo %X In this paper we extend previous work on automatic motion synthesis for physically realistic 2D articulated figures in three ways. First, we describe an improved motion-synthesis algorithm that runs substantially faster than previously reported algorithms. Second, we present two new techniques for influencing the style of the motions generated by the algorithm. These techniques can be used by an animator to achieve a desired movement style, or they can be used to guarantee variety in the motions synthesized over several runs of the algorithm. Finally, we describe an animation editor that supports the interactive concatenation of existing, automatically generated motion controllers to produce complex, composite trajectories. Taken together, these results suggest how a usable, useful system for articulated-figure motion synthesis might be developed. %X tr-05-94.ps.gz %R TR-06-94 %D 1994 %T Motion Synthesis for 3D Articulated Figures and Mass-Spring Models %A Hadi Partovi %A Jon Christensen %A Amir Khosrowshahi %A Joe Marks %A J. Thomas Ngo %X Motion synthesis is the process of automatically generating visually plausible motions that meet goal criteria specified by a human animator. The objects whose motions are synthesized are often animated characters that are modeled as articulated figures or mass-spring lattices. Controller synthesis is a technique for motion synthesis that involves searching in a space of possible controllers to generate appropriate motions. Recently, automatic controller-synthesis techniques for 2D articulated figures have been reported. An open question is whether these techniques can be generalized to work for 3D animated characters. In this paper we report successful automatic controller synthesis for 3D articulated figures and mass-spring models that are subject to nonholonomic constraints. These results show that the 3D motion-synthesis problem can be solved in some challenging cases, though much work on this general topic remains to be done. %X tr-06-94.ps.gz %R TR-07-94 %D 1994 %T Parallel implementation of recursive spectral bisection on the Connection Machine CM-5 system %A Zden\u{e}k Johan %A Kapil K. Mathur %A S. Lennart Johnsson %A Thomas J.R. Hughes %X The recursive spectral bisection (RSB) algorithm was proposed by Pothen {\em et al.} [1] as the basis for computing small vertex separators for sparse matrices. Simon [2] applied this algorithm to mesh decomposition and showed that spectral bisection compared favorably with other decomposition techniques. Since then, the RSB algorithm has been widely accepted in the scientific community because of its robustness and its consistency in the high-quality partitionings it generates. The major drawback of the RSB algorithm is its high computing cost, as noted in [2], caused by the need for solving a series of eigenvalue problems. It is often stated that an unstructured mesh can be decomposed after it is generated, and the decomposition reused for the different calculations performed on that mesh. However, a new partitioning is to be obtained if adaptive mesh refinement is required. The mesh also has to be re-decomposed if the number of processing nodes available to the user changes between two calculations. In order to avoid the mesh decomposition from becoming a significant computational bottleneck, an efficient data-parallel implementation of the RSB algorithm using the CM Fortran language [3] is developed. In this paper, we present only an abbreviated description of the parallel implementation of the RSB algorithm, followed by two decomposition examples. Details of the implementation can be found in [4]. %X tr-07-94.ps.gz %R TR-08-94 %D 1994 %T Mesh Decomposition and Communication Procedures for Finite Element Applications on the Connection Machine CM-5 System %A Zden\u{e}k Johan %A Kapil K. Mathur %A S. Lennart Johnsson %A Thomas J.R. Hughes %X The objective of this paper is to analyze the impact of data mapping strategies on the performance of finite element applications. First, we describe a parallel mesh decomposition algorithm based on recursive spectral bisection used to partition the mesh into element blocks. A simple heuristic algorithm then renumbers the mesh nodes. Large three-dimensional meshes demonstrate the efficiency of those mapping strategies and assess the performance of a finite element program for fluid dynamics. %X tr-08-94.ps.gz %R TR-09-94 %D 1994 %T Data Motion and High Performance Computing %A S. Lennart Johnsson %X Efficient data motion has been of critical importance in high performance computing almost since the first electronic computers were built. Providing sufficient memory bandwidth to balance the capacity of processors led to memory hierarchies, banked and interleaved memories. With the rapid evolution of MOS technologies, microprocessor and memory designs, it is realistic to build systems with thousands of processors and a sustained performance of a trillion operations per second or more. Such systems require tens of thousands of memory banks, even when locality of reference is exploited. Using conventional technologies, interconnecting several thousand processors with tens of thousands of memory banks can feasibly only be made by some form of sparse interconnection network. Efficient use of locality of reference and network bandwidth is critical. We review these issues in this paper. %X tr-09-94.ps.gz %R TR-10-94 %D 1994 %T Easily Searched Encodings for Number Partitioning %A Wheeler Ruml %A J. Thomas Ngo %A Joe Marks %A Stuart M. Shieber %X Can stochastic search algorithms outperform existing deterministic heuristics for the NP-hard problem Number Partitioning if given a sufficient, but practically realizable amount of time? In a thorough empirical investigation using a straightforward implementation of one such algorithm, simulated annealing, Johnson et al. (1991) concluded tentatively that the answer is "no." In this paper we show that the answer can be "yes" if attention is devoted to the issue of problem representation (encoding). We present results from empirical tests of several encodings of Number Partitioning with problem instances consisting of multiple-precision integers drawn from a uniform probability distribution. With these instances and with an appropriate choice of representation, stochastic and deterministic searches can -- routinely and in a practical amount of time -- find solutions several orders of magnitude better than those constructed by the best heuristic known (Karmarkar and Karp, 1982), which does not employ searching. The choice of encoding is found to be more important than the choice of search technique in determining search efficacy. Three alternative explanations for the relative performance of the encodings are tested experimentally. The best encodings tested are found to contain a high proportion of good solutions; moreover, in those encodings, the solutions are organized into a single "bumpy funnel" centered at a known position in the search space. This is likely to be the only relevant structure in the search space because a blind search performs as well as any other search technique tested when the search space is restricted to the funnel tip. We also show how analogous representations might be designed in a principled manner for other difficult combinatorial optimization problems by applying the principles of parameterized arbitration, parameterized constraint, and parameterized greediness. \par Keywords: number partitioning, NP-complete, representation, encoding, empirical comparison, stochastic optimization, parameterized arbitration, parameterized constraint, parameterized greediness. %X tr-10-94r.ps.gz %R TR-11-94 %D 1994 %T Principles and Implementation of Deductive Parsing %A Stuart M. Shieber %A Yves Schabes %A Fernando C. N. Pereira %X We present a system for generating parsers based directly on the metaphor of parsing as deduction. Parsing algorithms can be represented directly as deduction systems, and a single deduction engine can interpret such deduction systems so as to implement the corresponding parser. The method generalizes easily to parsers for augmented phrase structure formalisms, such as definite-clause grammars and other logic grammar formalisms, and has been used for rapid prototyping of parsing algorithms for a variety of formalisms including variants of tree-adjoining grammars, categorial grammars, and lexicalized context-free grammars. %X tr-11-94.ps.gz %R TR-12-94 %D 1994 %T Multiple Containment Methods %A Karen Daniels %A Zhenyu Li %A Victor Milenkovic %X We present three different methods for finding solutions to the 2D translation-only {\em containment} problem: find translations for $k$ polygons that place them inside a given polygonal container without overlap. Both the container and the polygons to be placed in it may be non-convex. First, we provide several exact algorithms that improve results for $k=2$ or $k=3$. In particular, we give an algorithm for three convex polygons and a non-convex container with running time in ${\rm O}(m^3n\log mn)$, where $n$ is the number of vertices in the container, and $m$ is the sum of the vertices of the $k$ polygons. This is an improvement of a factor of $n^2$ over previous algorithms. Second, we give an approximation algorithm for $k$ non-convex polygons and a non-convex container based on restriction and subdivision of the configuration space. Third, we develop a MIP (mixed integer programming) model for $k$ non-convex polygons and a non-convex container. %X tr-12-94.ps.gz %R TR-13-94 %D 1994 %T A Recursive Coalescing Method for Bisecting Graphs %A Bryan Mazlish %A Stuart Shieber %A Joe Marks %X We present an extension to a hybrid graph-bisection algorithm developed by Bui et al. that uses vertex coalescing and the Kernighan-Lin variable-depth algorithm to minimize the size of the cut set. In the original heuristic technique, one iteration of vertex coalescing is used to improve the performance of the original Kernighan-Lin algorithm. We show that by performing vertex coalescing recursively, substantially greater improvements can be achieved for standard random graphs of average degree in the range [2.0,5.0]. %X tr-13-94.ps.gz %R TR-14-94 %D 1994 %T PRISC: Programmable Reduced Instruction Set Computers %A Rahul Razdan %X This thesis introduces Programmable Reduced Instruction Set Computers (PRISC) as a new class of general-purpose computers. PRISC use RISC techniques as a base, but in addition to the conventional RISC instruction resources, PRISC offer hardware programmable resources which can be configured based on the needs of a particular application. This thesis presents the architecture, operating system, and programming language compilation techniques which are needed to successfully build PRISC. Performance results are provided for the simplest form of PRISC -- a RISC microprocessor with a set of programmable functional units consisting of only combinational functions. Results for the SPECint92 benchmark suite indicate that an augmented compiler can provide a performance improvement of 22\% over the underlying RISC computer with a hardware area investment less than that needed for a 2 kilobyte SRAM. In addition, active manipulation of the source code leads to significantly higher local performance gains (250\%-500\%) for general abstract data types such as short-set vectors, hash tables, and finite state machines. Results on end-user applications that utilize these data types indicate a performance gain from 32\%-213\%. %X tr-14-94.tar.Z %R TR-15-94 %D 1994 %T Compaction Algorithms for Non-Convex Polygons and Their Applications %A Zhenyu Li %X Given a two-dimensional, non-overlapping layout of convex and non-convex polygons, {\em compaction} refers to a simultaneous motion of the polygons that generates a more densely packed layout. In industrial two-dimensional packing applications, compaction can improve the material utilization of already tightly packed layouts. Efficient algorithms for compacting a layout of non-convex polygons are not previously known. \par This dissertation offers the first systematic study of compaction of non-convex polygons. We start by formalizing the compaction problem as that of planning a motion that minimizes some linear objective function of the positions. Based on this formalization, we study the complexity of compaction and show it to be PSPACE-hard. \par The major contribution of this dissertation is a position-based optimization model that allows us to calculate directly new polygon positions that constitute a locally optimum solution of the objective via linear programming. This model yields the first practically efficient algorithm for translational compaction--compaction in which the polygons can only translate. This compaction algorithm runs in almost real time and improves the material utilization of production quality human-generated layouts from the apparel industry. \par Several algorithms are derived directly from the position-based optimization model to solve related problems arising from manual or automatic layout generation. In particular, the model yields an algorithm for separating overlapping polygons using a minimal amount of motion. This separation algorithm together with a database of human-generated markers can automatically generate markers that approach human performance. \par Additionally, we provide several extensions to the position-based optimization model. These extensions enables the model to handle small rotations, to offer flexible control of the distances between polygons and to find optimal solution to two-dimensional packing of non-convex polygons. \par This dissertation also includes a compaction algorithm based on existing physical simulation approaches. Although our experimental results showed that it is not practical for compacting tightly packed layouts, this algorithm is of interest because it shows that the simulation can speed up significantly if we use geometrical constraints to replace physical constraints. It also reveals the inherent limitations of physical simulation algorithms in compacting tightly packed layouts. \par Most of the algorithms presented in this dissertation have been implemented on a SUN ${\rm SparcStation}^{\rm TM}$ and have been included in a software package licensed to a CAD company. %X tr-15-94.ps.gz %R TR-16-94 %D 1994 %T Scalability of Finite Element Applications on Distributed-Memory Parallel Computers %A Zden\u{e}k Johan %A Kapil K. Mathur %A S. Lennart Johnsson %A Thomas J.R. Hughes %X This paper demonstrates that scalability and competitive efficiency can be achieved for unstructured grid finite element applications on distributed memory machines, such as the Connection Machine CM-5 system. The efficiency of finite element solvers is analyzed through two applications: an implicit computational aerodynamics application and an explicit solid mechanics application. Scalability of mesh decomposition and data mapping strategies are also discussed. Numerical examples that support the claims for problems with an excess of fourteen million variables are presented. %X tr-16-94.ps.gz %R TR-17-94 %D 1994 %T Improved Noise-Tolerant Learning and Generalized Statistical Queries %A Javed A. Aslam %A Scott E. Decatur %X The statistical query learning model can be viewed as a tool for creating (or demonstrating the existence of) noise-tolerant learning algorithms in the PAC model. The complexity of a statistical query algorithm, in conjunction with the complexity of simulating SQ algorithms in the PAC model with noise, determine the complexity of the noise-tolerant PAC algorithms produced. Although roughly optimal upper bounds have been shown for the complexity of statistical query learning, the corresponding noise-tolerant PAC algorithms are not optimal due to inefficient simulations. In this paper we provide both improved simulations and a new variant of the statistical query model in order to overcome these inefficiencies. \par We improve the time complexity of the classification noise simulation of statistical query algorithms. Our new simulation has a roughly optimal dependence on the noise rate. We also derive a simpler proof that statistical queries can be simulated in the presence of classification noise. This proof makes fewer assumptions on the queries themselves and therefore allows one to simulate more general types of queries. \par We also define a new variant of the statistical query model based on relative error, and we show that this variant is more natural and strictly more powerful than the standard additive error model. We demonstrate efficient PAC simulations for algorithms in this new model and give general upper bounds on both learning with relative error statistical queries and PAC simulation. We show that any statistical query algorithm can be simulated in the PAC model with malicious errors in such a way that the resultant PAC algorithm has a roughly optimal tolerable malicious error rate and sample complexity. \par Finally, we generalize the types of queries allowed in the statistical query model. We discuss the advantages of allowing these generalized queries and show that our results on improved simulations also hold for these queries. %X tr-17-94.ps.gz %R TR-18-94 %D 1994 %T Finite Element Techniques for Computational Fluid Dynamics on the Connection Machine CM-5 System %A Z. Johan %A K.K. Mathur %A S.L. Johnsson %A T.J.R. Hughes %X tr-18-94.ps.gz %R TR-19-94 %D 1994 %T Scientific Software Libraries for Scalable Architectures %A S. Lennart Johnsson %A Kapil K. Mathur %X Massively parallel processors introduce new demands on software systems with respect to performance, scalability, robustness and portability. The increased complexity of the memory systems and the increased range of problem sizes for which a given piece of software is used, poses serious challenges to software developers. The Connection Machine Scientific Software Library, CMSSL, uses several novel techniques to meet these challenges. The CMSSL contains routines for managing the data distribution and provides data distribution independent functionality. High performance is achieved through careful scheduling of arithmetic operations and data motion, and through the automatic selection of algorithms at run--time. We discuss some of the techniques used, and provide evidence that CMSSL has reached the goals of performance and scalability for an important set of applications. %X tr-19-94.ps.gz %R TR-20-94 %D 1994 %T Load-Balanced LU and QR Factor and Solve Routines for Scalable Processors with Scalable I/O %A Jean-Philippe Brunet %A Palle Pedersen %A S. Lennart Johnsson %X The concept of block--cyclic order elimination can be applied to out--of--core $LU$ and $QR$ matrix factorizations on distributed memory architectures equipped with a parallel I/O system. This elimination scheme provides load balanced computation in both the factor and solve phases and further optimizes the use of the network bandwidth to perform I/O operations. Stability of LU factorization is enforced by full column pivoting. Performance results are presented for the Connection Machine system CM--5. %X tr-20-94.ps.gz %R TR-21-94 %D 1994 %T ROMM Routing: A Class of Efficient Minimal Routing Algorithms %A Ted Nesson %A Lennart Johnsson %X ROMM is a class of Randomized, Oblivious, Multi--phase, Minimal routing algorithms. ROMM routing offers a potential for improved performance compared to fully randomized algorithms under both light and heavy loads. ROMM routing also offers close to best case performance for many common permutations. These claims are supported by extensive simulations of binary cube networks for a number of routing patterns. We show that $k\times n$ buffers per node suffice to make $k$--phase ROMM routing free from deadlock and livelock on $n$--dimensional binary cubes. %X tr-21-94.ps.gz %R TR-22-94 %D 1994 %T Issues in High Performance Computer Networks %A S. Lennart Johnsson %X tr-22-94.ps.gz %R TR-23-94 %D 1994 %T A Comparative Study of Search and Optimization Algorithms for the Automatic Control of Physically Realistic 2-D Animated Figures %A Alex Fukunaga %A Jon Christensen %A J. Thomas Ngo %A Joe Marks %X In the Spacetime Constraints paradigm of animation, the animator specifies what a character should do, and the details of the motion are generated automatically by the computer. Ngo and Marks recently proposed a technique of automatic motion synthesis that uses a massively parallel genetic algorithm to search a space of motion controllers that generate physically realistic motions for 2D articulated figures. In this paper, we describe an empirical study of evolutionary computation algorithms and standard function optimization algorithms that were implemented in lieu of the massively parallel GA in order to find a substantially more efficient search algorithm that would be viable on serial workstations. We discovered that simple search algorithms based on the evolutionary programming paradigm were most efficient in searching the space of motion controllers. %X tr-23-94.ps.gz %R TR-24-94 %D 1994 %T Implementing O(N) N-body Algorithms Efficiently in Data Parallel Languages (High Performance Fortran) %A Yu Hu %A S. Lennart Johnsson %X O(N) algorithms for N-body simulations enable the simulation of particle systems with up to 100 million particles on current Massively Parallel Processors (MPPs). Our optimization techniques mainly focus on minimizing the data movement through careful management of the data distribution and the data references, both between the memories of different nodes, and within the memory hierarchy of each node. We show how the techniques can be expressed in languages with an array syntax, such as Connection Machine Fortran (CMF). All CMF constructs used, with one exception, are included in High Performance Fortran. \par The effectiveness of our techniques is demonstrated on an implementation of Anderson's hierarchical O(N) N-body method for the Connection Machine system CM-5/5E. Of the total execution time, communication accounts for about 10-20\% of the total time, with the average efficiency for arithmetic operations being about 40\% and the total efficiency (including communication) being about 35\%. For the CM-5E a performance in excess of 60 Mflop/s per node (peak 160 Mflop/s per node) has been measured. %X tr-24-94.ps.gz %R TR-25-94 %D 1994 %T Using Collaborative Plans to Model the Intentional Structure of Discourse %A Karen E. Lochbaum %X An agent's ability to understand an utterance depends upon its ability to relate that utterance to the preceding discourse. The agent must determine whether the utterance begins a new segment of the discourse, completes the current segment, or contributes to it. The intentional structure of the discourse, comprised of discourse segment purposes and their interrelationships, plays a central role in this process (Grosz and Sidner, 1986). In this thesis, we provide a computational model for recognizing intentional structure and utilizing it in discourse processing. The model specifies how an agent's beliefs about the intentions underlying a discourse affects and are affected by its subsequent discourse. We characterize this process for both interpretation and generation and then provide specific algorithms for modeling the interpretation process. \par The collaborative planning framework of SharedPlans (Lochbaum, Grosz, and Sidner, 1990; Grosz and Kraus, 1993) provides the basis for our model of intentional structure. Under this model, agents are taken to engage in discourses and segments of discourses for reasons that derive from the mental state requirements of action and collaboration. Each utterance of a discourse is understood in terms of its contribution to the SharedPlans in which the discourse participants are engaged. We demonstrate that this model satisfies the requirements of Grosz and Sidner's (1986) theory of discourse structure and also simplifies and extends previous plan-based approaches to dialogue understanding. The model has been implemented in a system that demonstrates the contextual role of intentional structure in both interpretation and generation. %X tr-25-94.ps.gz %R TR-26-94 %D 1994 %T A Data Parallel Implementation of Hierarchical N-body Methods %A Yu Hu %A S. Lennart Johnsson %X The O(N) hierarchical N-body algorithms and Massively Parallel Processors allow particle systems of 100 million particles or more to be simulated in acceptable time. We describe a data parallel implementation of Anderson's method and demonstrate both efficiency and scalability of the implementation on the Connection Machine CM-5/5E systems. The communication time for large particle systems amounts to about 10-25\%, and the overall efficiency is about 35\%. On a CM-5E the overall performance is about 60 Mflop/s per node, independent of the number of nodes. %X tr-26-94.ps.gz %R TR-27-94 %D 1994 %T Local Basic Linear Algebra Subroutines (LBLAS) for the CM--5/5E %A David Kramer %A S. Lennart Johnsson %A Yu Hu %X The Connection Machine Scientific Software Library (CMSSL) is a library of scientific routines designed for distributed memory architectures. The BLAS of the CMSSL have been implemented as a two--level structure to exploit optimizations local to nodes and across nodes. This paper presents the implementation considerations and performance of the Local BLAS, or BLAS local to each node of the system. A wide variety of loop structures and unrollings have been implemented in order to achieve a uniform and high performance, irrespective of the data layout in node memory. The CMSSL is the only existing high--performance library capable of supporting both the data parallel and message passing modes of programming a distributed memory computer. The implications of implementing BLAS on distributed memory computers are considered in this light. %X tr-27-94.ps.gz %R TR-28-94 %D 1994 %T Infrastructure for Research towards Ubiquitous Information Systems %A Barbara Grosz %A H.T. Kung %A Margo Seltzer %A Stuart Shieber %A Michael Smith %X tr-28-94.ps.gz %R TR-29-94 %D 1994 %T Automatic Derivation of Parallel and Systolic Programs %A Lilei Chen %X We present a simple method for developing parallel and systolic programs from data dependence. We derive sequences of parallel computations and communications based on data dependence and communication delays, and minimize the communication delays and processor idle time. The potential applications for this method include supercompiling, automatic development of parallel programs, and systolic array design. %X tr-29-94.ps.gz %R TR-30-94 %D 1994 %T VINO: An Integrated Platform for Operating System and Database Research %A Christopher Small %A Margo Seltzer %X In 1981, Stonebraker wrote: \par Operating system services in many existing systems are either too slow or inappropriate. Current DBMSs usually provide their own and make little or no use of those offered by the operating system. \par The standard operating system model has changed little since that time, and we believe that, at its core, it is the {\em wrong} model for DBMS and other resource-intensive applications. The standard model is inflexible, uncooperative, and irregular in its treatment of resources. \par We describe the design of a new system, the VINO kernel, which addresses the limitations of standard operating systems. It focuses on three key ideas: \par - Applications direct policy. - Kernel mechanisms are reusable by applications. - All resources share a common extensible interface. \par VINO's power and flexibility make it an ideal platform for the design and implementation of traditional and modern database management systems. %X tr-30-94.ps.gz %R TR-31-94 %D 1994 %T Abstract Execution in a Multi-Tasking Environment %A David Mazi\'{e}res %A Michael D. Smith %X Tracing software execution is an important part of understanding system performance. Raw CPU power has been increasing at a rate far greater than memory and I/O bandwidth, with the result that the performance of client/server and I/O-bound applications is not scaling as one might hope. Unfortunately, the behavior of these types of applications is particularly sensitive to the kinds of distortion induced by traditional tracing methods, so that current traces are either incomplete or of questionable accuracy. Abstract execution is a powerful tracing technique which was invented to speed the tracing of single processes and to store trace data more compactly. In this work, abstract execution was extended to trace multi-tasking workloads. The resulting system is more than 5 times faster than other current methods of gathering multi-tasking traces, and can therefore generate traces with far less time distortion. %X tr-31-94.ps.gz %R TR-32-94 %D 1994 %T Rationality %A L.G. Valiant %X tr-32-94.ps.gz %R TR-33-94 %D 1994 %T Derivatives of the Matrix Exponential and their Computation %A Igor Najfeld %A Timothy F. Havel %X Matrix exponentials and their derivatives play an important role in the perturbation analysis, control and parameter estimation of linear dynamical systems. The well-known integral representation of the derivative of the matrix exponential $\exp(t{\bf A})$ in the matrix direction ${\bf V}$, $\int_0^t{\bf\exp(({\it t}-\tau)A)V\exp(\tau A)}\,{\rm d}\tau$, enables us to derive a number of new properties of this derivative, along with spectral, series and exact representations. Many of these results extend to arbitrary analytic functions of a matrix argument, for which we have also derived a simple relation between the gradients of their entries and the directional derivatives in the elementary directions. Based on these results, we construct and optimize two new algorithms for computing the directional derivative. We have also developed a new algorithm for computing the matrix exponential, based on a rational representation of the exponential in terms of the hyperbolic function ${\bf A}{\bf coth}({\bf A})$, which is more efficient than direct Pade approximation. Finally, these results are illustrated by an application to a biologically important parameter estimation problem which arises in nuclear magnetic resonance spectroscopy. %X tr-33-94.ps.gz %R TR-34-94 %D 1994 %T VINO: The 1994 Fall Harvest %A Yasuhiro Endo %A James Gwertzman %A Margo Seltzer %A Christopher Small %A Keith A. Smith %A Diane Tang %X tr-34-94.ps.gz %R TR-35-94 %D 1994 %T File Layout and File System Performance %A Keith Smith %A Margo Seltzer %X Most contemporary implementations of the Berkeley Fast File System optimize file system throughput by allocating logically sequential data to physically contiguous disk blocks. This clustering is effective when there are many contiguous free blocks on the file system. But the repeated creation and deletion of files of varying sizes that occurs over time on active file systems is likely to cause fragmentation of free space, limiting the ability of the file system to allocate data contiguously and therefore degrading performance. \par This paper presents empirical data and the analysis of allocation and fragmentation in the SunOS 4.1.3 file system (a derivative of the Berkeley Fast File System). We have collected data from forty-eight file systems on four file servers over a period of ten months. Our data show that small files are more fragmented than large files, with fewer than 35% of the blocks in two block files being allocated optimally, but more than 80% of the blocks in files larger than 256 kilobytes being allocated optimally. Two factors are respon sible for this difference in fragmentation, an uneven distribution of free space within file system cylinder groups and a disk allocation algorithm which frequently allocates the last block of a file discontiguously from the rest of the file. \par Performance measurements on replicas of active file systems show that they seldom perform as well as comparable empty file systems but that this performance degradation is rarely more than 10-15%. This decline in performance is directly correlated to the amount of fragmentation in the files used by the benchmark programs. Both file system utilization and the amount of fragmentation in existing files on the file system influence the amount of fragmentation in newly created files. Characteristics of the file system workload also have a significant impact of file system fragmentation and performance, with typical news server workloads causing extreme fragmentation. %X tr-35-94.ps.gz %R TR-36-94 %D 1994 %T Bulk Synchronous Parallel Computing -- A Paradigm for Transportable Software %A Thomas Cheatham %A Amr Fahmy %A Dan C. Stefanescu %A Leslie G. Valiant %X A necessary condition for the establishment, on a substantial basis, of a parallel software industry would appear to be the availability of technology for generating transportable software, i.e. architecture independent software which delivers scalable performance for a wide variety of applications on a wide range of multiprocessor computers. This paper describes H-BSP -- a general purpose parallel computing environment for developing transportable algorithms. H-BSP is based on the Bulk Synchronous Parallel Model (BSP), in which a computation involves a number of supersteps, each having several parallel computational threads that synchronize at the end of the superstep. The BSP Model deals explicitly with the notion of communication among computational threads and introduces parameters g and L that quantify the ratio of communication throughput to computation throughput, and the synchronization period, respectively. These two parameters, together with the number of processors and the problem size, are used to quantify the performance and, therefore, the transportability of given classes of algorithms across machines having different values for these parameters. This paper describes the role of unbundled compiler technology in facilitating the development of such a parallel computer environment. %Xtr-36-94.ps.gz %R TR-01-95 %D 1995 %T Bayesian Grammar Induction for Language Modeling %A Stanley F. Chen %X We describe a corpus-based induction algorithm for probabilistic context-free grammars. The algorithm employs a greedy heuristic search within a Bayesian framework, and a post-pass using the Inside-Outside algorithm. We compare the performance of our algorithm to n-gram models and the Inside-Outside algorithm in three language modeling tasks. In two of these domains, our algorithm outperforms these other techniques, marking the first time a grammar-based language model has surpassed n-gram modeling in a task of at least moderate size. %X tr-01-95.ps.gz %R TR-02-95 %D 1995 %T Learning in Order to Reason %A Dan Roth %X Any theory aimed at understanding {\em commonsense} reasoning, the process that humans use to cope with the mundane but complex aspects of the world in evaluating everyday situations, should account for its flexibility, its adaptability, and the speed with which it is performed. \par In this thesis we analyze current theories of reasoning and argue that they do not satisfy those requirements. We then proceed to develop a new framework for the study of reasoning, in which a learning component has a principal role. We show that our framework efficiently supports a lot ``more reasoning" than traditional approaches and at the same time matches our expectations of plausible patterns of reasoning in cases where other theories do not. \par In the first part of this thesis we present a computational study of the knowledge-based system approach, the generally accepted framework for reasoning in intelligent systems. We present a comprehensive study of several methods used in approximate reasoning as well as some reasoning techniques that use approximations in an effort to avoid computational difficulties. We show that these are even harder computationally than exact reasoning tasks. What is more surprising is that, as we show, even the approximate versions of these approximate reasoning tasks are intractable, and these severe hardness results on approximate reasoning hold even for very restricted knowledge representations. \par Motivated by these computational considerations we argue that a central question to consider, if we want to develop computational models for commonsense reasoning, is how the intelligent system acquires its knowledge and how this process of interaction with its environment influences the performance of the reasoning system. The {\em Learning to Reason} framework developed and studied in the rest of the thesis exhibits the role of inductive learning in achieving efficient reasoning, and the importance of studying reasoning and learning phenomena together. The framework is defined in a way that is intended to overcome the main computational difficulties in the traditional treatment of reasoning, and indeed, we exhibit several positive results that do not hold in the traditional setting. We develop Learning to Reason algorithms for classes of theories for which no efficient reasoning algorithm exists when represented as a traditional (formula-based) knowledge base. We also exhibit Learning to Reason algorithms for a class of theories that is not known to be learnable in the traditional sense. Many of our results rely on the theory of model-based representations that we develop in this thesis. In this representation, the knowledge base is represented as a set of models (satisfying assignments) rather than a logical formula. We show that in many cases reasoning with a model-based representation is more efficient than reasoning with a formula-based representation and, more significantly, that it suggests a new view of reasoning, and in particular, of logical reasoning. \par In the final part of this thesis, we address another fundamental criticism of the knowledge-based system approach. We suggest a new approach for the study of the non-monotonicity of human commonsense reasoning, within the Learning to Reason framework. The theory developed is shown to support efficient reasoning with incomplete information, and to avoid many of the representational problems which existing default reasoning formalisms face. \par We show how the various reasoning tasks we discuss in this thesis relate to each other and conclude that they are all supported together naturally. %X tr-02-95.ps.gz %R TR-03-95 %D 1995 %T Translating between Horn Representations and their Characteristic Models %A Roni Khardon %X Characteristic models are an alternative, model based, representation for Horn expressions. It has been shown that these two representations are incomparable and each has its advantages over the other. It is therefore natural to ask what is the cost of translating, back and forth, between these representations. Interestingly, the same translation questions arise in database theory, where it has applications to the design of relational databases. \par We study the complexity of these problems and prove some positive and negative results. Our main result is that the two translation problems are equivalent under polynomial reductions, and that they are equivalent to the corresponding decision problem. Namely, translating is equivalent to deciding whether a given set a models is the set of characteristic models for a given Horn expression. \par We also relate these problems to translating between the CNF and DNF representations of monotone functions, a well known problem for which no polynomial time algorithm is known. It is shown that in general our translation problems are at least as hard as the latter, and in a special case they are equivalent to it. %X tr-03-95.ps.gz %R TR-04-95 %D 1995 %T Volume of a Hyper-Parallelepiped after Affine Transformations, and its Application to Optimal Parallel Loop Execution %A Yan-Zhong Ding %A Dan Stefanescu %X This paper presents a theoretical framework for the efficient scheduling of a class of parallel loop nests on distributed memory parallel computers. The method generates two classes of schedules, evaluates them according to a full-fledged cost model and then selects the best option. The cost model used is the Bulk Synchronous Parallel model. The method can generate schedules whose efficiency is tailored to any parallel architecture and any parameters characterizing the parallel loops. As an application, we generate optimal schedules for the matrix-matrix multiplication problem for general matrices, thus extending previous results for square matrices. This is an example of a compiler optimization for transportable parallel software. %R TR-05-95 %D 1995 %T A Proposed New Memory Manager %A Robert L. Walton %X Memory managers should support compactification, multiple simultaneous garbage collections, and ephemeral collections in a realtime multi-processor shared memory environment. They should permit old addresses of an object to be invalidated without significant delay, and should permit array accesses with no per-element inefficiency. \par A new approach to building an optimal standard solution to these requirements is presented for stock hardware and next generation languages. If such an approach should become a standard, this would spur the development of standard hardware to optimize away the overhead. %X tr-05-95.ps.gz %R TR-06-95 %D 1995 %T A Comparative Analysis of Schemes for Correlated Branch Prediction %A Cliff Young %A Nicolas Gloy %A Michael D. Smith %X Modern high-performance architectures require extremely accurate branch prediction to overcome the performance limitations of con ditional branches. We present a framework that categorizes branch prediction schemes by the way in which they partition dynamic branches and by the kind of predictor that they use. The framework allows us to compare and contrast branch prediction schemes, and to analyze why they work. We use the framework to show how a static correlated branch prediction scheme increases branch bias and thus improves overall branch prediction accuracy. We also use the framework to identify the fundamental differences between static and dynamic correlated branch prediction schemes. This study shows that there is room to improve the prediction accuracy of existing branch prediction schemes. %X tr-06-95.ps.gz %R TR-07-95 %D 1995 %T Efficient Learning of Real Time One-Counter Automata %A Amr Fahmy %A Robert Roos %X We present an efficient learning algorithm for languages accepted by deterministic real time one counter automata (ROCA). The learning algorithm works by first learning an initial segment, $B_n$, of the infinite state machine that accepts the unknown language and then decomposing it into a complete control structure and a partial counter. A new efficient ROCA decomposition algorithm, which will be presented in detail, allows this result. The decomposition algorithm works in $O(n^2log(n))$ where $nc$ is the number of states of $B_n$ for some constant $c$. \par If Angluin's algorithm for learning regular languages is used to learn $B_n$ and the complexity of this step is $h(n,m)$ where $m$ is the length of the longest counter example necessary for Angluin's algorithm, the complexity of our algorithm is thus $O(h(n,m) + n^2 log(n))$. %X tr-07-95.ps.gz %R TR-08-95 %D 1995 %T ROMM Routing on Mesh and Torus Networks %A Ted Nesson %A S. Lennart Johnsson %X ROMM is a class of Randomized, Oblivious, Multi--phase, Minimal routing algorithms. ROMM routing offers a potential for improved performance compared to both fully randomized algorithms and deterministic oblivious algorithms, under both light and heavy loads. ROMM routing also offers close to best case performance for many common routing problems. In previous work, these claims were supported by extensive simulations on binary cube networks . Here we present analytical and empirical results for ROMM routing on wormhole routed mesh and torus networks. Our simulations show that ROMM algorithms can perform several representative routing tasks 1.5 to 3 times faster than fully randomized algorithms, for medium--sized networks. Furthermore, ROMM algorithms are always competitive with deterministic, oblivious routing, and in some cases, up to 2 times faster. %X tr-08-95.ps.gz %R TR-09-95 %D 1995 %T The Impact of Operating System Structure on Personal Computer Performance %A J. Bradley Chen %A Yasuhiro Endo %A Kee Chan %A David Mazieres %A Antonio Dias %A Margo Seltzer %A Michael Smith %X This paper presents a comparative study of the performance of three operating systems that run on the personal computer architecture derived from the IBM-PC. The operating systems, Windows for Workgroups (tm), Windows NT (tm), and NetBSD (a freely available UNIX (tm) variant) cover a broad range of system functionality and user requirements, from a single address space model to full protection with preemptive multitasking. Our measurements were enabled by hardware counters in Intel's Pentium (tm) processor that permit measurement of a broad range of processor events including instruction counts and on-chip cache miss rates. We used both microbenchmarks, which expose specific differences between the systems, and application workloads, which provide an indication of expected end-to-end performance. Our microbenchmark results show that accessing system functionality is more expensive in Windows than in the other two systems due to frequent changes in machine mode and the use of system call hooks. When running native applications, Windows NT is more efficient than Windows, but it does incur overhead from its microkernel structure. Overall, system functionality can be accessed most efficiently in NetBSD; we attribute this to its monolithic structure, and to the absence of the complications created by backwards compatibility in the other sys tems. Measurements of application performance show that the impact of these differences is significant in terms of overall execution time. %X tr-09-95.ps.gz %R TR-10-95 %D 1995 %T Managing Design Complexity: Using Stochastic Optimization in the Production of Computer Graphics %A Jon Christensen %X This thesis examines the automated design of computer graphics. We present a methodology that emphasizes optimization, problem representation, stochastic search, and empirical analysis. Two problems are considered, which together encompass and exemplify both 2D and 3D graphics production: label placement and motion synthesis for animation. \par Label placement is the problem of annotating various informational graphics with textual labels, subject to constraints that respect proper label-feature associativity, label and feature obscuration, and aesthetically desirable label positions. Examples of label placement include applying textual labels to a geographic map, or item tags to a scatterplot. Motion synthesis is the problem of composing a visually plausible motion for an animated character, subject to animator-imposed constraints on the form and characteristics of the desired motion. \par For each problem we propose new solution methods that utilize efficient problem representations combined with stochastic optimization techniques. We demonstrate that these methods offer significant advantages over competing solutions in terms of ease-of-use, visual quality, and computational efficiency. Taken together, these results also demonstrate an effective approach for continued progress in automating graphical design, which should be applicable to a wide range of graphical design applications beyond the two considered here. %X tr-10-95.ps.gz %X tr-10-95-p70.ps.gz %X tr-10-95-p71.ps.gz %R TR-11-95 %D 1995 %T Interpreting Cohesive Forms in the Context of Discourse Inference %A Andrew Kehler %X In this thesis, we present analyses and algorithms for resolving a variety of cohesive phenomena in natural language, including VP-ellipsis, gapping, event reference, tense, and pronominal reference. Past work has attempted to explain the complicated behavior of these expressions with theories that operate within a single module of language processing. We argue that such approaches cannot be maintained; in particular, the data we present strongly suggest that the nature of the coherence relation operative between clauses needs to be taken into account. \par We provide a theory of coherence relations and the discourse inference processes that underly their recognition. We utilize this theory to break the deadlock between syntactic and semantic approaches to resolving VP-ellipsis. We show that the data exhibits a pattern with respect to our categorization of coherence relations, and present an account which predicts this pattern. We extend our analysis to gapping and event reference, and show that our analyses result in a more independently-motivated and empirically-adequate distinction among types of anaphoric processes than past analyses. \par We also present an account of VP-ellipsis resolution that predicts the correct set of `strict' and `sloppy' readings for a number of benchmark examples that are problematic for past approaches. The correct readings can be seen to result from a general distinction between `referring' and `copying' in anaphoric processes. The account also extends to other types of reference, such as event reference and `one'-anaphora. \par Finally, we utilize our theory of coherence in analyses that break the deadlock between definite-reference and coherence-based approaches to tense and pronoun interpretation. We present a theory of tense interpretation that interacts with discourse inference processes to predict data that is problematic for both types of approach. We demonstrate that the data commonly cited in the pronoun interpretation literature also exhibits a pattern with respect to coherence relations, and make some preliminary proposals for how such a pattern might result from the properties of the different types of discourse inference we posit. %X tr-11-95.ps.gz %R TR-12-95 %D 1995 %T Containment Algorithms for Nonconvex Polygons with Applications to Layout %A Karen McIntosh Daniels %X Layout and packing are NP-hard geometric optimization problems which appear in a variety of manufacturing industries. At their core, layout and packing problems have the common geometric feasibility problem of {\em containment}: find a way of placing a set of items into a container. We focus on containment and its applications to layout and packing problems. We demonstrate that, although containment is NP-hard, it is fruitful to: 1) develop algorithms for containment, as opposed to heuristics, 2) design containment algorithms so that they say ``no'' almost as fast as they say ``yes'', 3) use geometric techniques, not just mathematical programming techniques, and 4) maximize the number of items for which the algorithms are practical. \par Our approach to containment is based on a new restrict/evaluate/subdivide paradigm. We develop theory and practical techniques for the operations within the paradigm. The techniques are appropriate for two-dimensional containment problems in which the items and container may be irregular polygons, and in which the items may be translated, but not rotated. Our techniques can be combined to form a variety of two-dimensional translational containment algorithms. The paradigm is designed so that, unlike existing iteration-based algorithms, containment algorithms based on the paradigm are adept at saying ``no'', even for slightly infeasible problems. We present two algorithms based on our paradigm. We obtain the first practical running times for NP-complete two-dimensional translational containment problems for up to ten nonconvex items in a nonconvex container. \par We demonstrate that viewing containment as a feasibility problem has many benefits for packing and layout problems. For example, we present an effective method for finding minimal enclosures which uses containment to perform binary search on a parameter. Compaction techniques can accelerate the search. We also use containment to develop the first practical pre-packing strategy for a multi-stage pattern layout problem in apparel manufacturing. Pre-packing is a layout method which packs items into a collection of containers by first generating groups of items which fit into each container and then assigning groups to containers. %X tr-12-95.ps.gz %R TR-13-95 %D 1995 %T Probabilistic Cache Replacement %A J. Bradley Chen %X Modern microprocessors tend to use on-chip caches that are much smaller than the working set size of many interesting computations. In such situations, cache performance can be improved through selective caching, use of cache replacement policies where data fetched from memory, although forwarded to the CPU, is not necessarily loaded into the cache. This paper introduces a selective caching policy called Probabilistic Cache Replacement (PCR) in which caching of data fetched from main memory is determined by a probabilistic boolean-valued function. Use of PCR creates a self-selection mechanism in which repeated misses to a word in memory increase its probability of being loaded into the cache. A PCR cache gives better reductions in instruction cache miss rate than a comparable cache configuration with a victim-cache. Instruction cache miss rates can be reduced by up to 30% for some of the SPECmarks, although the optimal probability distribution is workload dependent. This paper also presents a mechanism called Feedback PCR which dynamically selects probability values for a PCR cache. For an 16 K byte direct-mapped instruction cache, Feedback PCR with a one-entry MFB gives an average reduction in cache misses of over 11% across the SPECmarks with no significant increase in cache misses for any of the workloads, and compares favorably with other alternatives of similar hardware cost. %X tr-13-95.ps.gz %R TR-14-95 %D 1995 %T MOSS: A Mobile Operating Systems Substrate %A J. Bradley Chen %A H.T. Kung %A Margo Seltzer %X The Mobile Operating System Substrate (MOSS) is a new system architecture for wireless mobile computing being developed at Harvard. MOSS provides highly efficient, robust and flexible virtual device access over wireless media. MOSS services provide mobile access to such resources as disks, CD ROM drives, displays, wired network interfaces, and audio and video devices. MOSS services are composed of virtual circuits and virtual devices. Virtual circuits (VCs) on wireless media support the spectrum of quality-of-service (QoS) levels required to cover a broad range of application requirements. Virtual devices implement resource access using VCs as their communication substrate. The tight coupling of network code and device implementations makes it possible to apply device-specific semantics to communications resource management problems. MOSS will enable mobile software systems to adapt dynamically to the rapidly changing computing and communications environment created by mobility. %X tr-14-95.ps.gz %R TR-15-95 %D 1995 %T Reasoning with Examples: Propositional Formulae and Database Dependencies %A Roni Khardon %A Heikki Mannila %A Dan Roth %X For humans, looking at how concrete examples behave is an intuitive way of deriving conclusions. The drawback with this method is that it does not necessarily give the correct results. However, under certain conditions example-based deduction can be used to obtain a correct and complete inference procedure. This is the case for Boolean formulae (reasoning with models) and for certain types of database integrity constraints (the use of Armstrong relations). We show that these approaches are closely related, and use the relationship to prove new results about the existence and sizes of Armstrong relations for Boolean dependencies. Further, we study the problem of translating between different representations of relational databases, in particular we consider Armstrong relations and Boolean dependencies, and prove some positive results in that context. Finally, we discuss the close relations between the questions of finding keys in relational databases and that of finding abductive explanations. %X tr-15-95.ps.gz %R TR-16-95 %D 1995 %T The Case for Extensible Operating Systems %A Margo Seltzer %A Keith Smith %A Christopher Small %X Many of the performance improvements cited in recent operating systems research describe specific enhancements to normal operating system functionality that improve performance in a set of designated test cases. Global changes of this sort can improve performance for one application, at the cost of decreasing performance for others. We argue that this flurry of global kernel tweaking is an indication that our current operating system model is inappropriate. Existing interfaces do not provide the flexibility to tune the kernel on a per-application basis, to suit the variety of applications that we now see. \par We have failed in the past to be omniscient about future operating system requirements; there is no reason to believe that we will fare any better designing a new, fixed kernel interface today. Instead, the only general-purpose solution is to build an operating system interface that is easily extendable. We present a kernel framework designed to support the application-specific customization that is beginning to dominate the operating system literature. We show how this model enables easy implementation of many of the earlier research results. We then analyze two specific kernel policies: page read-ahead and lock-granting. We show that application-control over read-ahead policy produces performance improvements of up to 16\%. We then show how application-control over the lock-granting policy can choose between fairness and response time. Reader priority algorithms produce lower read response time at the cost of writer starvation. FIFO algorithms avoid the starvation problem, but increase read response time. %X tr-16-95.ps.gz %R TR-17-95 %D 1995 %T Autonomous Replication in Wide-Area Internetworks %A James Gwertzman %X The number of users connected to the Internet has been growing at an exponential rate, resulting in similar increases in network traffic and Internet server load. Advances in microprocessors and network technologies have kept up with growth so far, but we are reaching the limits of hardware solutions. In order for the Internet's growth to continue, we must efficiently distribute server load and reduce the network traffic generated by its various services. \par Traditional wide-area caching schemes are client initiated. Decisions on where and when to cache information are made without the benefit of the server's global knowledge of the situation. We introduce a technique, push-caching, that is server initiated; it leaves caching decisions to the server. The server uses its knowledge of network topology, geography, and access patterns to minimize network traffic and server load. \par The World Wide Web is an example of a large-scale distributed information system that will benefit from this geographical distribution, and we present an architecture that allows a Web server to autonomously replicate Web files. We use a trace-driven simulation of the Internet to evaluate several competing caching strategies. Our results show that while simple client caching reduces server load and network bandwidth demands by up to 30\%, adding server-initiated caching reduces server load by an additional 20\% and network bandwidth demands by an additional 10\%. Furthermore, push-caching is more efficient than client-caching, using an order of magnitude less cache space for comparable bandwidth and load savings. \par To determine the optimal cache consistency protocol we used a generic server simulator to evaluate several cache-consistency protocols, and found that weak consistency protocols are sufficient for the World Wide Web since they use the same bandwidth as an atomic protocol, impose less server load, and return stale data less than 1\% of the time. %X tr-17-95.ps.gz %R TR-18-95 %D 1995 %T Centering: A Framework for Modelling the Local Coherence of Discourse %A Barbara J. Grosz %A Aravind K. Joshi %A Scott Weinstein %X tr-18-95.ps.gz %R TR-19-95 %D 1995 %T Benchmarking Filesystems %A Diane L. Tang %X One of the most widely researched areas in operating systems is filesystem design, implementation, and performance. Almost all of the research involves reporting performance numbers gathered from a variety of different benchmarks. The problem with such results is that existing filesystem benchmarks are inadequate, suffering from problems ranging from not scaling with advancing technology to not measuring the filesystem. \par A new approach to filesystem benchmarking is presented here. This methodology is designed both to help system designers understand and improve existing systems and to help users decide which filesystem to buy or run. For usability, the benchmark is separated into two parts: a suite of micro-benchmarks, which is actually run on the filesystem, and a workload characterizer. The results from the two separate parts can be combined to predict the performance of the filesystem on the workload. \par The purpose for this separation of functionality is two-fold. First, many system designers would like their filesystem to perform well under diverse workloads: by characterizing the workload independently, the designers can better understand what is required of the filesystem. The micro-benchmarks tell the designer what needs to be improved while the workload characterizer tells the designer whether that improvement will affect filesystem performance under that workload. This separation also helps users trying to decide which system to run or buy, who may not be able to run their workload on all systems under consideration, and therefore need this separation. \par The implementation of this methodology does not suffer from many of the problems seen in existing benchmarks: it scales with technology, it is tightly specified, and it helps system designers. This benchmark's only drawbacks are that it does not accurately predict the performance of a filesystem on a workload, thus limiting its applicability: it is useful to system designers, but not for users trying to decide which system to buy. The belief is that the general approach will work, given additional time to manipulate the prediction algorithm. %X tr-19-95.ps.gz %R TR-20-95 %D 1995 %T Collaborative Plans for Complex Group Action %A Barbara J. Grosz %A Sarit Kraus %X The original formulation of SharedPlans was developed to provide a model of collaborative planning in which it was not necessary for one agent to have intentions-to toward an act of a different agent. Unlike other contemporaneous approaches, this formulation provided for two agents to coordinate their activities without introducing any notion of irreducible joint intentions. However, it only treated activities that directly decomposed into single-agent actions, did not address the need for agents to commit to their joint activity, and did not adequately deal with agents having only partial knowledge of the way in which to perform an action. This paper provides a revised and expanded version of SharedPlans that addresses these shortcomings. It also reformulates Pollack's definition of individual plans to handle cases in which a single agent has only partial knowledge; this reformulation meshes with the definition of SharedPlans. The new definitions also allow for contracting out certain actions. The formalization that results has the features required by Bratman's account of shared cooperative activity and is more general than alternative accounts. %X tr-20-95.ps.gz %R TR-21-95 %D 1995 %T Instructions for Annotating Discourse %A Christine H. Nakatani %A Barbara J. Grosz %A David D. Ahn %A Julia Hirschberg %X tr-21-95.ps.gz %R TR-22-95 %D 1995 %T Finding the Largest Rectangle in Several Classes of Polygons %A Karen Daniels %A Victor J. Milenkovic %A Dan Roth %X This paper considers the geometric optimization problem of finding the Largest area axis-parallel Rectangle (LR) in an $n$-vertex general polygon. We characterize the LR for general polygons by considering different cases based on the types of contacts between the rectangle and the polygon. A general framework is presented for solving a key subproblem of the LR problem which dominates the running time for a variety of polygon types. This framework permits us to transform an algorithm for orthogonal polygons into an algorithm for nonorthogonal polygons. Using this framework, we obtain the following LR time results: $\Theta(n)$ for $xy$-monotone polygons, ${\rm O}(n \alpha(n))$ for orthogonally convex polygons, (where $\alpha(n)$ is the slowly growing inverse of Ackermann's function), ${\rm O}(n \alpha(n) \log n)$ for horizontally (vertically) convex polygons, ${\rm O}(n \log n)$ for a special type of horizontally convex polygon (whose boundary consists of two $y$-monotone chains on opposite sides of a vertical line), and ${\rm O}(n \log^2 n)$ for general polygons (allowing holes). For all these types of non-orthogonal polygons, we match the running time of the best known algorithms for their orthogonal counterparts. A lower bound of time in $\Omega(n \log n)$ is established for finding the LR in both self-intersecting polygons and general polygons with holes. The latter result gives us both a lower bound of $\Omega(n \log n)$ and an upper bound of ${\rm O}(n \log^2 n)$ for general polygons. %X tr-22-95.ps.gz %R TR-23-95 %D 1995 %T Performance Issues in Correlated Branch Prediction Schemes %A Nicolas Gloy %A Michael D. Smith %A Cliff Young %X Accurate static branch prediction is the key to many techniques for exposing, enhancing, and exploiting Instruction Level Parallelism (ILP). The initial work on static correlated branch prediction (SCBP) demonstrated improvements in branch prediction accuracy, but did not address overall performance. In particular, SCBP expands the size of executable programs, which negatively affects the performance of the instruction memory hierarchy. Using the profile information available under SCBP, we can minimize these negative performance effects through the application of code layout and branch alignment techniques. We evaluate the performance effect of SCBP and these profile-driven optimizations on instruction cache misses, branch mispredictions, and branch misfetches for a number of recent processor implementations. We find that SCBP improves performance over (traditional) per-branch static profile prediction. We also find that SCBP improves the performance benefits gained from branch alignment. As expected, SCBP gives larger benefits on machine organizations with high mispredict/misfetch penalties and low cache miss penalties. Finally, we find that the application of profile-driven code layout and branch alignment techniques (without SCBP) can improve the performance of the dynamic correlated branch prediction techniques. %X tr-23-95.ps.gz %R TR-24-95 %D 1995 %T Randomized, Oblivious, Minimal Routing Algorithms for Multicomputers %A Ted Nesson %X Efficient data motion has been critical in high performance computing for as long as computers have been in existence. Massively parallel computers use a sparse interconnection network between processing nodes with local memories. Minimizing the potential for high congestion of communication links is an important goal in the design of routing algorithms and interconnection networks in these systems. \par In these distributed--memory architectures, the communication system represents a significant portion of the total system cost, but is nevertheless often a weak link in the system with respect to performance. Efficient interprocessor communication is one of the most important and most challenging problems associated with massively parallel computing. Communication delays can easily represent a large fraction of the total running time, inhibiting high performance computing for a wide range of problems. Efficient use of the communication system is the focus of this thesis. \par The design of the interconnection network and the routing algorithms used to transport data between nodes are critical to overall system performance. The constraints imposed by a sparse interconnection network suggest that preserving locality of reference through careful data allocation and minimizing network load by using minimal algorithms are desirable objectives. \par In this thesis, we present ROMM, a new class of general--purpose message routing algorithms for large--scale, distributed--memory multicomputers. ROMM is a class of Randomized, Oblivious, Multi--phase, Minimal routing algorithms. We will show that ROMM routing offers the potential for improved performance compared to both fully randomized algorithms and deterministic oblivious algorithms, under both light and heavy loads. ROMM routing also offers close to best--case performance for many common routing tasks. These claims are supported by extensive analysis and simulation of ROMM routing on several different interconnection network architectures, for a set of representative routing tasks. Furthermore, our results show that non--minimality and adaptivity, two common techniques for reducing congestion, are not always required for good routing performance. %X tr-24-95.ps.gz %R TR-25-95 %D 1995 %T Translational Polygon Containment and Minimal Enclosure using Geometric Algorithms and Mathematical Programming %A Victor J.Milenkovic %A Karen M. Daniels %X We present an algorithm for the two-dimensional translational {\em containment} problem: find translations for $k$ polygons (with up to $m$ vertices each) which place them inside a polygonal container (with $n$ vertices) without overlapping. The polygons and container may be nonconvex. The containment algorithm consists of new algorithms for {\em restriction}, {\em evaluation}, and {\em subdivision} of two-dimensional configuration spaces. The restriction and evaluation algorithms both depend heavily on linear programming; hence we call our algorithm an {\em LP containment algorithm}. Our LP containment algorithm is distinguished from previous containment algorithms by the way in which it applies principles of mathematical programming and also by its tight coupling of the evaluation and subdivision algorithms. Our new evaluation algorithm finds a local overlap minimum. Our distance-based subdivision algorithm eliminates a ``false'' (local but not global) overlap minimum and all layouts near that overlap minimum, allowing the algorithm to make progress towards the global overlap minimum with each subdivision. /par In our experiments on data sets from the apparel industry, our LP algorithm can solve containment for up to ten polygons in a few minutes on a desktop workstation. Its practical running time is better than our previous containment algorithms and we believe it to be superior to all previous translational containment algorithms. Its theoretical running time, however, depends on the number of local minima visited, which is $\bigo((6kmn+k^2m^2)^{2k+1}/k!)$. To obtain a better theoretical running time, we present a modified (combinatorial) version of LP containment with a running time of \[ \bigo\left(\frac{(6kmn+k^2m^2)^{2k}}{(k-5)!} \log kmn \right), \] which is better than any previous combinatorial containment algorithm. For constant $k$, it is within a factor of $\log mn$ of the lower bound. /par We generalize our configuration space containment approach to solve {\em minimal enclosure} problems. We give algorithms to find the minimal enclosing square and the minimal area enclosing rectangle for $k$ translating polygons. Our LP containment algorithm and our minimal enclosure algorithms succeed by combining rather than replacing geometric techniques with linear programming. This demonstrates the manner in which linear programming can greatly increase the power of geometric algorithms. %X tr-25-95.ps.gz %R TR-26-95 %D 1995 %T Kernel Instrumentation Tools and Techniques %A J. Bradley Chen %A Alan Eustace %X Atom is a powerful platform for the implementation of profiling, debugging and simulation tools. Kernel support in ATOM makes it possible to implement similar tools for the Digital UNIX kernel. We describe four non-trivial Atom kernel tools which demonstrate the support provided in Atom for kernel work as well as the range of application of Atom kernel tools. We go on to discuss some techniques that are generally useful when using Atom with the kernel. Prior techniques restrict kernel measurements to the domain of exotic systems research. We hope Atom technology will make kernel instrumentation and measurement practical for a much larger community of researchers. %X tr-26-95.ps.gz %R TR-27-95 %D 1995 %T On the Transportation and Distribution of Data Structures in Parallel and Distributed Systems %A Amr F. Fahmy %A Robert A. Wagner %X We present algorithms for the transportation of data in parallel and distributed systems that would enable programmers to transport or distribute a data structure by issuing a function call. Such a functionality is needed if programming distributed memory systems is to become commonplace. \par The distribution problem is defined as follows. We assume that $n$ records of a data structure are scattered among $p$ processors where processor $q_i$ holds $r_{i}$ records, $1 \leq i \leq p$. The problem is to redistribute the records so that each processor holds $\lfloor n/p \rfloor$ records. We solve the problem in the minimum number of parallel data-permutation operations possible, for the given initial record distribution. This means that we use $max( mxr - \lfloor n/p \rfloor, \lfloor n/p \rfloor - mnr )$ parallel data transfer steps, where $mxr = max(r_{i})$ and $ mnr = min(r_{i})$ for $1 \leq i \leq p$. \par Having solved the distribution problem, it then remains to transport the data structure from the memory of one processor to another. In the case of dynamically allocated data structures, we solve the problem of renaming pointers by creating an intermediate name space. We also present a transportation algorithm that attempts to hide the cost of making a local copy for the data structure which, is necessary since the data structure could be scattered in the memory of the sender. %X tr-27-95.ps.gz %R TR-28-95 %D 1995 %T Learning to take Actions %A Roni Khardon %X We formalize a model for supervised learning of action strategies in dynamic stochastic domains and show that PAC-learning results on Occam algorithms hold in this model as well. We then identify a class of rule based action strategies for which polynomial time learning is possible. The representation of strategies is generalization of decision lists; strategies include rules with existentially quantified conditions, simple recursive predicates, and small internal state, but are syntactically restricted. We also study the learnability of hierarchically composed strategies where a subroutine already acquired can be used as a basic action in a higher level strategy. We prove some positive results in this setting, but also show that in some cases the hierarchical learning problem is computationally hard. %X tr-28-95.ps.gz %R TR-29-95 %D 1995 %T Network Related Performance Issues and Techniques for MPPs %A S. Lennart Johnsson %X In this paper we review network related performance issues for current Massively Parallel Processors (MPPs) in the context of some important basic operations in scientific and engineering computation. The communication system is one of the most performance critical architectural components of MPPs. In particular, understanding the demand posed by collective communication is critical in architectural design and system software implementation. We discuss collective communication and some implementation techniques therefore on electronic networks. Finally, we give an example of a novel general routing technique that exhibits good scalability, efficiency and simplicity in electronic networks. %X tr-29-95.ps.gz %R TR-30-95 %D 1995 %T Efficient Learning from Faulty Data %A Scott Evan Decatur %X Learning systems are often provided with imperfect or noisy data. Therefore, researchers have formalized various models of learning with noisy data, and have attempted to delineate the boundaries of learnability in these models. In this thesis, we describe a general framework for the construction of efficient learning algorithms in noise tolerant variants of Valiant's PAC learning model. By applying this framework, we also obtain many new results for specific learning problems in various settings with faulty data.

The central tool used in this thesis is the specification of learning algorithms in Kearns' Statistical Query (SQ) learning model, in which statistics, as opposed to labelled examples, are requested by the learner. These SQ learning algorithms are then converted into PAC algorithms which tolerate various types of faulty data.

We develop this framework in three major parts:

- We design automatic compilations of SQ algorithms into PAC algorithms which tolerate various types of data errors. These results include improvements to Kearns' classification noise compilation, and the first such compilations for malicious errors, attribute noise and new classes of ``hybrid'' noise composed of multiple noise types.
- We prove nearly tight bounds on the required complexity of SQ algorithms. The upper bounds are based on a constructive technique which allows one to achieve this complexity even when it is not initially achieved by a given SQ algorithm.
- We define and employ an improved model of SQ learning which yields noise tolerant PAC algorithms that are more efficient than those derived from standard SQ algorithms.