Harvard Computer Science Technical Reports for 1998

TR-01-98 [tr-01-98.ps.gz (83 K)]
Joe Marks, Wheeler Ruml, Stuart M. Shieber, and J.Thomas Ngo. 1998. ``A Seed-Growth Heuristic for Graph Bisection.''
We present a new heuristic algorithm for graph bisection, based on an implicit notion of clustering. We describe how the heuristic can be combined with stochastic search procedures and a postprocess application of the Kernighan-Lin algorithm. In a series of time-equated comparisons with large-sample runs of pure Kernighan-Lin, the new algorithm demonstrates significant superiority in terms of the best bisections found.
TR-02-98 [tr-02-98.ps.gz (66 K)]
Michael Karr. 1998. ``Activity Graph.''
This paper discusses activity graphs, the mathematical formalism underlying the Activity Coordination System, a process-centered system for collaborative work. The activity graph model was developed for the purpose of describing, tracking, and guiding distributed, communication-intensive activities. Its graph-based semantics encompasses the more familiar Petri nets, but has several novel properties. For example, it is possible to impose multiple hierarchies on the same graph, so that the hierarchy with which an activity is described does not have to be the one with which it is viewed. The paper also discusses very briefly some aspects of the system implementation.
TR-03-98 [tr-03-98.ps.gz (60 K)]
Michael Karr and Thomas E. Cheatham, Jr.. 1998. ``An Activity Coordination System.''

TR-04-98 [tr-04-98.ps.gz (130 K)]
Leslie G. Valiant. 1998. ``A Neuroidal Architecture for Cognitive Computation.''
An architecture is described for designing systems that acquire and manipulate large amounts of unsystematized, or so-called commonsense, knowledge. Its aim is to exploit to the full those aspects of computational learning that are known to offer powerful solutions in the acquisition and maintenance of robust knowledge bases. The architecture makes explicit the requirements on the basic computational tasks that are to be performed and is designed to make these computationally tractable even for very large databases. The main claims are that (i) the basic learning tasks are tractable and (ii) tractable learning offers viable approaches to a range of issues that have been previously identified as problematic for artificial intelligence systems that are entirely programmed. In particular, attribute efficiency holds a central place in the definition of the learning tasks, as does also the capability to handle relational information efficiently. Among the issues that learning offers to resolve are robustness to inconsistencies, robustness to incomplete information and resolving among alternatives.
Glenn Holloway and Steve Squires. 1998. ``E-L Rational.''
TR-06-98 [tr-06-98.ps.gz (76 K)]
Rebecca Hwa. 1998. ``An Empirical Evaluation of Probabilistic Lexicalized Tree Insertion Grammer.''
We present an empirical study of the applicability of Probabilistic Lexicalized Tree Insertion Grammars (PLTIG), a lexicalized counterpart to Probabilistic Context-Free Grammars (PCFG), to problems in stochastic natural-language processing. Comparing the performance of PLTIGs with non-hierarchical $N$-gram models and PCFGs, we show that PLTIG combines the best aspects of both, with language modeling capability comparable to $N$-grams, and improved parsing performance over its non-lexicalized counterpart. Furthermore, training of PLTIGs displays faster convergence than PCFGs.
TR-07-98 [tr-07-98.ps.gz (537 K)]
Joshua Goodman. 1998. ``Parsing Inside-Out.''
The inside-outside probabilities are typically used for reestimating Probabilistic Context Free Grammars (PCFGs), just as the forward-backward probabilities are typically used for reestimating HMMs. In this thesis, I show several novel uses, including improving parser accuracy by matching parsing algorithms to evaluation criteria; speeding up DOP parsing by 500 times; and 30 times faster PCFG thresholding at a given accuracy level. I also give an elegant, state-of-the-art grammar formalism, which can be used to compute inside-outside probabilities; and a parser description formalism, which makes it easy to derive inside-outside formulas and many others.
TR-08-98 - This report has been withdrawn
, and . 1998. ``.''

TR-09-98 [tr-09-98.ps.gz (272 K)]
Zheng Wang and Norm Rubin. 1998. ``A Statistical Analysis of User-Specific Profiles.''
This technical report examines common assumptions about computer users in profile-based optimization. We study execution profiles of interactive applications on Windows NT to understand how different users use the same program. The profiles were generated by the DIGITAL FX!32 emulator/binary translator system, which automatically runs the x86 version of Windows NT programs on NT/Alpha computers. Our statistical analysis indicates that people use the benchmark programs in different ways. This technical report is a supplement to the paper "Evaluating the Importance of User-Specific Profiling," to appear in "Proceedings of the 2nd USENIX Windows NT Symposium," USENIX Association, August 1998.
TR-10-98 [tr-10-98.ps.gz (260 K)]
Stanley F. Chen and Joshua Goodman. 1998. ``An Empirical Study of Smoothing Techniques for Language Modeling.''
We present a tutorial introduction to $n$-gram models for language modeling and survey the most widely-used smoothing algorithms for such models. We then present an extensive empirical comparison of several of these smoothing techniques, including those described by Jelinek and Mercer(1980), Katz (1987), Bell Cleary and Witten (1990), Ney, Essen, and Kneser (1994), and Kneser and Ney (1995). We investigate how factors such as training data size, training corpus (e.g., Brown versus Wall Street Journal), count cutoffs, and n-gram order (bigram versus trigram) affect the relative performance of these methods, which is measured through the cross-entropy of test data. Our results show that previous comparisons have not been complete enough to fully characterize smoothing algorithm performance. We introduce methodologies for analyzing smoothing algorithm efficacy in detail, and using these techniques we motivate a novel variation of Kneser-Ney smoothing that consistently outperforms all other algorithms evaluated. Finally, results showing that improved language model smoothing leads to improved speech recognition performance are presented.
TR-11-98 [tr-11-98.ps.gz (21 K)]
Adolph Baker and Ellen Baker. 1998. ``Expectation Value of the Lowest of a Set of Randomly Selected Integers.''
Consider the set of positive integers 0, 1, 2, ..., D. If we pick N of them at random, where N < (D+1), what is the expectation (or average value) of the lowest-valued of the N picks? We briefly describe the image database search question that gave rise to this problem, and present a proof that the answer is (D-N+1)/(N+1).
TR-12-98 [tr-12-98.ps.gz (179 K)]
Dan Ellard. 1998. ``CS50 1997 Quantitative Study.''
A discussion and analysis of a quantitative study of CS50, Harvard's introductory Computer Science course. It describes a system for passively monitoring and gathering data about the use of various programming utilities by students, gives an overview of the data gathered with this system, and an analysis of the relationship between student work habits and other factors and success in CS50. Finally, it discusses ideas for how this research could be extended and continued.

This research was performed as part of Dan Ellard's qualifying examination.

TR-13-98 [tr-13-98.ps.gz (23 K)]
Daniel J. Ellard, Penelope A. Ellard, James M. Megquier, J. Bradley Chen, and Margo I. Seltzer. 1998. ``The ANT Architecture -- An Architecture for CS1.''
An overview of the pedagogical and philosophical motivations for teaching machine architecture in CS1, and a description of the ANT architecture, which was specifically designed for use in our introductory computer science course.
TR-14-98 [tr-14-98.ps.gz (105 K)]
Daniel J. Ellard, Penelope A. Ellard, James M. Megquier, and J. Bradley Chen. 1998. ``The ANT Architecture -- An Architecture For CS1.''
A description of the ANT architecture, how we use ANT in our CS1 curriculum, and the pedagogical issues that motivated the creation and design of the ANT architecture.
TR-15-98 [tr-15-98.ps.gz (130 K)]
Daniel J. Ellard and Penelope A. Ellard. 1998. ``Data Representation and Assembly Language Programming The ANT-97 Architecture .''
A tutorial for data representation (principally two's complement notation and arithmetic and ASCII codes) and assembly language programming using the ANT architecture. This tutorial has been used in Harvard's CS1 course (1996-1998).
TR-16-98 [tr-16-98.ps.gz (6386 K)]
Ellen Jill Baker. 1998. ``The Mug-Shot Search Problem A Study of the Eigenface Metric, Search Strategies, and Interfaces in a System for Searching Facial Image Data.''
This thesis presents an investigation of methods for conducting an efficient look-up in a pictorial "phonebook" (i.e., a facial image database). Although research on efficient "mug-shot search" is under way, little has yet been done to evaluate the effectiveness of various proposed techniques, and much work remains before systems as practical or ubiquitous as phonebooks are attainable. The thesis describes a prototype system based on the idea of combining a composite face creation method with a face-recognition technique, so that a user may create a facial image and then automatically locate other similar-looking faces in the database. Several methods for evaluating such a system are presented as well as the results and analysis of a user-study employing the methods.

Three basic system components are considered and evaluated: the metric for determining which faces are most similar in appearance to a given "query" face, the interface for producing the query face, and the search strategy. The data demonstrate that the Eigenface metric is a useful (though imperfect) model of human perception of similarity between faces. The data also show how the lack of agreement among people about which faces are most similar to a query limits what can be reasonably expected from any metric. Via simulation, it is demonstrated that, if indeed there were a single human metric for assessing facial similarity, and if the Eigenface metric correlated perfectly with this human metric, then simple interactive hill-climbing in the space of the database images would be an excellent search strategy, capable of reducing the average number of image inspections required in a search to about 2% of the database. But this superiority of hill-climbing in principle is not sustained in practice, given the observed level of correlation between the Eigenface similarity metric and the "human" one. The average number of image inspections required for the hill-climbing strategy was, in fact, closer to 35% of the database. While this represents an improvement over the 50% required on average for a simple sequential search of the data, it is still insufficient for practical use. However, given the actual performance of the Eigenface metric, the study data show that a non-iterative strategy of constructing a single query image that is a composite of selected features from 100 random database faces is a better approach, reducing the average number of image inspections to about 20% of the database. These and other examples demonstrate and quantify the benefits of an interface in which the Eigenface metric is combined with a composite creation system.