Orthogonal transforms are ubiquitous in engineering and applied
science applications. Such applications place stringent requirements
on a computer system, either in terms of time, which translates into
processing power, or problem size, which translates into memory size,
or most often both. To achieve such high performance, these
applications have been migrating to the realm of parallel computing,
offering significantly more processing power and memory. This trend
mandates the development of fast parallel algorithms for orthogonal
transforms, that would be versatile on many parallel platforms, and is
the subject of this thesis.
We present a unified approach in seeking fast parallel orthogonal
transforms by establishing a theoretical formulation for each
algorithm, presenting a simple specification to facilitate
implementation and analysis, and by evaluating performance via a set
of predefined arithmetic, memory, communication and load--imbalance
complexity metrics. We adopt a bottom up approach by constructing the
thesis in the form of progressively larger modular blocks, each
relying on a lower level block for its formulation, algorithmic
specification and performance analysis.
Since communication is the bottleneck for parallel computing, we first
establish an algebraic framework for stable parallel permutations, the
predominant communication requirement for parallel orthogonal
transforms. We present a taxonomy categorizing stable permutations
into classes of index--digit, linear, translation, affine and
polynomial permutations. For each class, we demonstrate its general
behavioral properties and prove permutation locality and other
properties for particular examples.
These results are then applied in formulating, specifying and
evaluating performance of the direct and load--balanced parallel
algorithms for the fast Fourier transform (FFT); we prove the latter
to be optimal. We demonstrate the versatility of our complexity
metrics by substituting them into the PRAM, LPRAM, BSP and LogP
computational models. Consequently, we employ the load--balanced FFT
in deriving a novel polynomial--based discrete cosine transform (DCT)
and demonstrate its performance advantage over the classical
DCT. Finally, the polynomial DCT is used as a building block for a
novel parallel fast Legendre transform (FLT), a parallelization of the
Driscoll--Healey O(N log^2 N) method. We show that the algorithm
is hierarchically load--balanced by composing its specification and
complexity from its modular blocks.