Chilton::INF::Who Should Think Parallel?

Who Should Think Parallel?

Chris Wadsworth

June 1989

Software for Parallel Computers, Unicom

Abstract

An evolving taxonomy of styles of parallelism in problems, and common paradigms for solutions, is presented. It is argued that this leads to a better separation between issues which are the proper concern of the user and those which are provided by the system (whether by hardware or by software). A virtual architecture for MIMD machines is outlined, based on predictions of future hardware features.

1 Introduction
2 Why Parallel Processing
3 Problem Parallelism
4 Solution Paradigms
5 Virtual Architecture
6 A Future Red Herring
7 Concluding Remarks
8 References

1 Introduction

In fact the advent of parallel programming may do something to revive the pioneering spirit in programming, which seems to be degenerating into a rather dull and routine occupation.

S. Gill, 1958 [3]

Plus ca change, plus ca meme chose! Hardware of course has always been parallel, e.g. overlapped I/0, parallel adders, and gate logic, but the user has not needed to be aware of this. The hardware, or in some cases the systems programmer, does all the work for him.

From the mid-1950s to the late-1970s a number of experimental parallel machines were designed, but none were a commercial success. Progress in parallel programming was relatively limited during this period (though of course there was much interesting work done on the theoretical basis for concurrency and on the discovery of parallel algorithms for particular problems).

The 1980s has seen an explosion of commercially available parallel machines with a wide variety of architectures, e.g. array processors, hypercubes, transputer systems, and shared-memory multiprocessors. Almost overnight the VLSI revolution in particular has made parallel processing affordable; indeed the systems mentioned now typically offer improvements in price/performance of an order of magnitude or better compared to conventional uniprocessors.

Advances in parallel programming have not been so sudden (or dramatic)! Programming the new parallel machines was certainly initially difficult. Early examples were perhaps too tightly coded for a particular machine. Gradually a body of experience, knowledge, approaches, and techniques has been built up, both of applications apparently best suited to one style of architecture (see, e.g., Fox [1] [2]) and of methods and parallel thinking of a more generic nature.

This paper gives an overview of some of the unifying ideas that have emerged, with some remarks on probable future trends in parallel programming and on the implications for parallel languages and (virtual) architectures. Some related predictions are presented by Jesshope [9]. We seek to stress generic ideas, common paradigms, and abstraction. Our concern in fact is with a separation of concerns: distinguishing the rightful concerns of the user from those of the system (whether provided by hardware or by software). From the user's point of view provision of the right abstractions is regarded as somebody else's problem [10], though no less important or interesting for that.

2 Why Parallel Processing

Let us be in no doubt at the outset: parallel processing is a means to an end. The end-user is typically motivated by one or more of the following main potential benefits:

cheaper;
faster;
bigger (problems);
more (functionality).

In short, better performance in one direction or another. At this level the motivation for parallel processing is little different from that for any other advance in computing technology ( or indeed from that for computerising some task in the first place). One difference is that parallel solutions are often intrinsically simpler in structure, e.g. when modelling a system which is inherently parallel (see 3.1 below).

The bottom line is economics. The economics of mass-produced VLSI components is increasingly favouring parallel systems built from many cheap processors. Current systems are already achieving a price/performance advantage in the range from 10 to 30 times better than traditional minicomputers, mainframes, and supercomputers; e.g. a parallel CFD program ported by Rolls-Royce/TopExpress gives 20% of CRAY performance on a transputer system costing 100 times less [6].

This last comparison, needless to say, is based on hardware (purchase) costs alone. The costs in developing parallel programs, both in time and effort, are as yet not well understood; indeed there appear to be few studies which address the issues at all, though Tucker [16] points to some of the commercial issues. A related difficulty is that of balancing extra development effort against run-time gains. One thing has become clear, however, from the experience gained in recent years: there are now a growing range of applications and styles of parallelism for which parallel programming is no longer so terribly difficult.

3 Problem Parallelism

Problems vary widely in the degree to which parallelism is evident in the problem itself.

3.1 Obvious parallelism

Sometimes parallelism does not need to be teased out of a problem at all, it is staring the user in the face. Problems involving large, regular data spaces are often trivially solved by partitioning the data into pieces that can be processed concurrently. Ray tracing [13] is a typical example. In many cases the degree of parallelism evident in the data far exceeds the number of processors that are currently affordable. Such problems have been referred to as being embarrassingly parallel [1].

Other examples in this category are found in real-world systems consisting of many interacting agents, e.g. a control system for an industrial plant with multiple sensors and controllers. The real-world system is inherently parallel; it is far simpler and more natural to follow this parallelism than to squeeze a solution onto a single processor. It may well be the best way to ensure that any real-time demands are met. Scalable performance is often of lesser importance in this area.

The user concerns in this category are predominantly threefold:

to have a convenient means of expressing the obvious parallelism;
to have the work spread evenly across the available processors;
to have any real-time constraints satisfied.

3.2 Discovered parallelism

More commonly a problem has no obvious parallelism, or the obvious parallel algorithm turns out to be a poor solution. The human, however, knows or is able to devise a good strategy by discovering parallelism in an approach to a solution.

There are many such examples in the literature, e.g. in linear algebra, sorting algorithms, and graph problems. Another example is the travelling salesman problem; here the obvious parallel search of all permutations is infeasible, but a technique based on simulated annealing [2, pg.215] yields a good solution (in this case an approximate one).

The key point about this category is that human ingenuity will always discover clever parallel algorithms that are unlikely to be found by automatic means (at least for the foreseeable future!). The user concerns, once a parallel strategy has been devised, are much the same as for the previous category, except that the user may need to be more conscious of balancing the work load in his design. With current understanding also the algorithm may well need to be chosen with the specific target architecture in mind; the best parallel algorithm for, say, matrix multiplication on a DAP may well be unsuitable for a transputer system, or vice versa.

3.3 Implicit parallelism

A third approach is to ignore parallelism! The user programs in a language without explicit parallel features and the system compiles into parallel code as it knows how. In principle, this approach has clear portability advantages.

Foremost in this category is the use of declarative languages, such as functional and logic languages. Their strength in this context derives from the fact that these languages remain uncommitted about matters of order of evaluation, leaving the system free to make choices appropriate to an implementation (whether parallel or not). As yet, however, parallel implementations of declarative languages are not competitive, though there is much active research, e.g. see Peyton-Jones [14] for a recent progress report. A major focus in this research is the design of novel target architectures, e.g. dataflow and graph reduction, and possibly special hardware.

At an opposite extreme are the conventional sequential languages such as Fortran, C, and Cobol. A large body of (large) programs written in these languages has been built up over the last 30 years or so, the so-called dusty-deck problem. Such programs may be characterised as being embarrassingly serial. It is impractical in the short term to contemplate any wholesale rewriting into a parallel language. More attractive is the possibility of compilers which detect and extract parallelism automatically. There are now quite good compilers for shared-memory machines, the Alliant Fortran compiler for instance is based on a range of advanced techniques for data-dependence analysis, but for distributed memory machines automatic parallelisation is still a research topic. Other papers in this conference discuss the progress in software migration aids which can assist the task of porting existing sequential software by hand without extensive rewriting.

3.4 Discussion

The first two categories, collectively covered by the term explicit parallelism, are clearly distinct from the third, and are at a more advanced stage of development and exploitation. Implicit parallelism promises substantial benefits for the future, but for the time being explicit parallelism holds sway for real applications work on grounds of efficient implementation, and to some extent on grounds of familiarity.

The situation with efficiency here is not unlike that which pertained when the first high-level language compilers were appearing. Many felt that a compiler would never produce as good code as hand-written assembler. We all know which way that argument went: the advantages of high-level languages soon came to far outweigh any loss of (run-time) efficiency, particularly as the compilers got better; only in special circumstances does anyone program in assembler these days.

It is too soon to predict that the same may happen to explicitly parallel programming. Perhaps it never will completely; many will feel that parallelism is too far-reaching to be automated. But need the user be concerned with so much of the fine detail?

Current languages for expressing parallelism explicitly may be broadly classified into two families:

(new) process languages, e.g. Occam, Ada;
parallel extensions to sequential languages, e.g. several variants of parallel C, Fortran, Pascal, and Lisp.

Both offer a pragmatic way forward in the short term, the first perhaps more general than the second, but the current languages do have a number of drawbacks to them. Occam [7], for instance, though it provides excellent support for a theoretically-proven model of concurrency, is rather spartan in its normal programming features (no data structures, no recursion, no data abstraction, no type parameterisation, no modules, etc). In the second family, the examples of parallel extensions we have seen seem to us methodologically unsound, both as exercises in language design and in making the task of producing safe, reliable software harder rather than easier. There is also the problem of incompatible, non-standard variants among extensions of the same language.

Occam's experience has an important lesson for the promotion of new languages. There has been strong resistance to Occam from the applications user community just on the basis that it is a new language. Occam is also seen as being too closely associated with one architecture, the transputer. New languages must, in our view, be integrated as part of a coherent development methodology with a full environment of necessary support tools from the start. (Alternatively, they must have the backing of a big funding agency!)

4 Solution Paradigms

There are by now a number of common paradigms for parallel implementations of many applications. The classification which follows is drawn from Hey [5] with two additions.

4.1 Farm parallelism

This paradigm applies to problems whose data space is partitioned into many pieces, which we shall call packets, that can be processed independently. A master, or farmer, process farms out packets to each of several identical worker processes. The workers return the results to the master, which then sends a further packet. By partitioning the data into rather more packets than there are workers, the master is able to keep all the workers busy even if some packets take much longer to compute than others. In other words, this solution load balances automatically.

In this case there is no need for communication between the worker processes. The only communication is that between the master and the workers for distributing the work and returning the results. Depending on the problem, these distribution costs are negligible provided the packet size can be chosen large enough to amortise the distribution time.

Ray tracing [13] is the best-known example. Another is high-energy physics programs that run many simulations on multiple event sets [4]. There are many examples also in commercial applications, e.g. transaction processing, and payroll.

4.2 Geometric parallelism

In this case, as the name suggests, there is a geometric structure to the data space, e.g. volumes in a fluid dynamics problem. Unlike farm parallelism, however, the computations for each region are not independent and may now depend on that of their immediate neighbours (only).

This case is naturally modelled by assigning each region to a different processor (statically). Typically the same code is run on each processor, though not necessarily. Provided the compute times (between communications) for each region are roughly balanced, and the frequency of communication is not too high, the solution works well.

Again there is a choice about the size (and shape) of the regions. It is not necessary either that the processor topology matches the dimensionality of the problem, e.g. a 3D-problem can be solved on a 2D-array by projecting one dimension onto each processor, or a 2D-problem can be projected onto a linear chain of processors. Extensions to irregular meshes, e.g. in finite element problems, are also possible.

A toy example is the game of Life, in which successive states of points in a grid are determined as a combination of preceding states of their neighbours. The commonest real applications arise in numerical integration of partial differential equations, e.g. for problems in electromagnetics, fluid dynamics, and weather forecasting.

4.3 Long-range parallelism

Life gets more difficult when the rules are changed! Suppose the propagation rule is made more complicated, with next states depending on the previous states of all points in the system. What we have then is essentially the n-body problem in astrophysics (motion under gravity), or multi-particle dynamics in molecular modelling.

Effective solutions have been developed based on a ring of processors [11]. Much as. before, the particles are divided (equally) amongst the processors. On each timestep, each processor starts a packet around the ring, carrying information about its own particles and accumulating the effect of the forces due to others. On return to the home processor the states are updated. A simple optimisation is easily added; by exploiting symmetry of forces (and packets), packets need only travel half-way round the ring, then take a short-cut home.

4.4 Algorithmic parallelism

Having exhausted the easy cases of data parallelism, we are left with partitioning the program. This is algorithmic parallelism, which is much less regular in structure.

One well-known example is processes connected in a pipeline with the output from each passing as input to the next. This generalises, through splitting and merging, to a process net, but no other common subcases have been identified. Systolic arrays are essentially a special formulation of process nets, as are other novel approaches such as dataflow and demandflow.

4.5 Dynamic parallelism

Each of the above involves only static parallelism, i.e. the partition into processes is fixed at design-time (some would say compile-time). There are many problems for which this is not possible, or at least not easily so. In dynamic parallelism new processes are generated at run-time in a data-dependent fashion.

There are few generally useful paradigms in this case. One that has been studied is divide-and-conquer (-and-combine) [12]. Each subject problem spawns zero or more subproblems, each of which is solved recursively then combined to solve the subject problem. The number of sub-problems, and hence sub-processes in the natural parallel implementation, is dependent on the data values in the subject problem.

One solution in this case is to simulate the dynamic parallelism by using a paradigm-specific kernel on each processor. The kernel used in [12] included experiments with process-stealing rules whereby idle processors steal work from their neighbours.

4.6 Discussion

The classification is useful but incomplete. Static data parallelism is well covered though there are no doubt other useful sub-paradigms waiting to discovered. It is also well catered for practically with a range of standard harnesses available - program templates which set up the overall process structure and do all the communications, into which the problem-specific code (which may be purely sequential) is slotted. The user either uses a standard harness as is or may extend it for additional facilities, e.g. monitoring.

Dynamic parallelism and algorithmic parallelism need further refinement into useful sub-paradigms.

5 Virtual Architecture

There are too many parallel architectures! The best way forward is to define an abstract programming model based on a virtual architecture, with systems software implementing the abstractions as appropriate across a variety of parallel hardware choices.

It would be unrealistic to suggest that one single virtual architecture will suffice, e.g. how would novel approaches such as dataflow or neural nets fit in. We shall limit our attention· to the dominant examples of MIMD machines, and their expected successors, for which there are already indications of convergence to a common programming model.

Our discussion is oriented to distributed memory machines, but we see no reason why the model should not be applied to shared memory. Indeed it may actually be easier to implement on a shared-memory machine. The challenge for shared-memory architectures is in a different direction: scalability to large numbers of processors.

We shall look at four areas - communications, processing, libraries, and memory - roughly in the order in which we expect abstract models with good solutions to emerge.

5.1 Virtual communications

Most MIMD machines use point-to-point communication links and are well-suited for algorithms involving only local communications. General communication between distant processors is far less efficient, because of the software overheads in each intermediate processor. It is also cumbersome for the programmer to be concerned with routing messages passing through as well as his own communications. Software harnesses have been developed for doing this but remain poor in efficiency.

The desirable abstract model is obvious: that of a logically totally-connected network in which the user need not be concerned about routing. The challenge is to devise routing methods for which distant communications are as efficient as local.

Recently Intel [8] launched their second-generation hypercube, the iPSC-2, which includes special hardware for direct connect through-routing of messages. There are also strong indications from Inmos that the next-generation transputer will provide some form of through-routing. Convergence to a common model is thus well underway in this area.

What is less clear is the entity addressed by a message: a processor, a process, or a (logical) channel between processes? Another choice is the semantics of message passing: synchronised or not? guaranteed delivery or not? buffering? It is up to the users to pull the providers in the preferred direction.

5.2 Virtual processing

Placement of (logical) processes onto (physical) processors, and specification of interconnect topology (when configurable), is currently done manually by the user, either at the outermost program level as in Occam, or outside the programming language at the operating system level in a configuration file. Both methods work, but are notoriously tricky and tedious. Placement of an Occam channel, for instance, at a particular link number is very low level, analogous to allocating a variable at a particular memory location.

The user does not normally care how his processes are allocated to processors when these are drawn from a pool of identical processors. There are exceptions, however. Sometimes several processes must be placed on the same processor, e.g. de-coupled communications and compute processes; or it may be that not all processors are identical, e.g. a process for controlling a particular device can only be run on the processor to which that device is attached (graphics, input sensors, disks, etc). Current mechanisms may be seen as an overkill: because it is necessary for some placements to be specified by the user, the method requires the user to specify all.

What is needed is a means for the user to express only the essential placement constraints. Beyond that it should be the responsibility of the system. A good solution for virtual communications should make automatic placement much more straightforward.

The above covers static allocation only. For dynamic parallelism, allocation at run-time becomes essential. Process migration also needs to be considered, to balance the load between processors. Some studies of dynamic allocation and migration strategies have been conducted, but there is no definitive solution yet. One example is Equus [15].

5.3 Virtual libraries

Virtual is something of a misnomer in this case. It is standard practice on uniprocessors to have standard libraries in which the individual routines are implemented differently on different machines.

It is trivial to extend the practice to parallel machines, except for one complication: reconfigurability (including scalability). On a reconfigurable machine, different routines may run best on different configurations (ring, grid, tree, or whatever). When several library routines are used in the same program, either the routines need to be parameterised for the configuration chosen for the program (if static configuration) or reconfiguration must be done at run-time and not be too slow.

What is the library implementer, who cannot know the programs in which the library routines will be used, to do? A better answer should follow from solutions in the previous two areas.

5.4 Virtual memory

This again is standard on uniprocessors and is generally provided in conjunction with multitasking and/or multi-user timesharing. In that context virtual memory fulfills two distinct functions:

memory protection: protecting the address space of each user (or task) against those of others (including the system kernel);
large address space: enabling a user program to have an address space (much) larger than the available physical memory.

(The second alone is often taken as virtual memory, but provision of both is almost always via the same mechanism of a memory management unit.)

On a parallel machine with shared memory, it is straightforward to extend the implementation from that on a uniprocessor, just some choice perhaps whether the memory management unit is centralised close to the memory (and swap-space on disks) or distributed closer to the processors.

With distributed memory, the need for memory protection is best avoided by assigning users to different (sets of) processors. Why share a processor between users when there are lots (of processors)? Support for large address spaces is much more difficult to provide. Few processors will be directly connected to disks; that would destroy the economics of massive parallelism. So where would one page to, and how? Paging across the communication network is possible in principle, but is likely to be too slow in practice. This seems to us a particularly difficult nut for the system designers to crack.

Perhaps there is another way of thinking about applications whose memory requirements exceed the total physical memory of all the processors. Certainly there will always be users with such demands! For the time being, they may have to revert to defining their own overlays.

5.5 Discussion

The virtual architecture we have sketched is obvious enough; the challenge is implementing the functionality without sacrificing performance. To some extent what we have given is a wish list. We (are led to) believe that effective solutions are likely in the near future.

At least the first and last areas will clearly need hardware support. The second is to do with user languages (placement constraints) and system software tools (automatic placement). Libraries may become trivial when the others are solved.

6 A Future Red Herring

A little controversy to close .... The literature on parallel processing is almost obsessed with performance in a narrow sense, that of efficiency of processor utilisation. This is understandable for current parallel systems which, though relatively cheap, are costly enough. Efficiency matters when resources are scarce or expensive. Few of us have that many processors! Even now there are not that many systems with 100 processors, let alone 1000 or more.

If the industry hype is to be believed, processing will soon become a cheap, plug-in resource just like memory. Predictions have been made that office workstations with perhaps 1000 processors, and central installations perhaps 100 times larger, will be commonplace before the end of the century. Processor utilisation then becomes a secondary issue, of concern to the system provider perhaps but little to the user. After all, who really worries about memory utilisation any more?

Of course the user cannot afford to be grossly inefficient. He must choose a good parallel algorithm, just as for a sequential sorting algorithm he would choose, say, an O(n log n) algorithm rather than an O(n²) algorithm. But that is another story ....

7 Concluding Remarks

We have argued that by examining styles of parallelism, common paradigms for parallel solutions, the parallel programming model, and the capability of current and predicted parallel hardware, we can arrive at a better separation between matters which are the proper concern of the user and functionality desired from the system. A virtual architecture for future MIMD machines has been outlined.

There are many further issues - granularity, scalability, synchronisation, deadlock, correctness, for instance - each of which gives rise to similar questions about the relative roles of the user and the system. When we get the answers right, parallel processing will have come of age. If would be very surprising, however, if there is a unique answer.

To Gill's chagrin perhaps, parallel programming is now emerging from its pioneering stage, though the prospect ahead is certainly not dull!

8 References

[1] G.C. Fox, "What Have We Learnt from Using Real Parallel Machines to Solve Real Problems", 3rd Conf on Hypercube Concurrent Computers and Applications, JPL Pasadena, California, January 1988.

[2] G.C. Fox et al, Solving Problems on Concurrent Processors, Prentice-Hall, Englewood Cliffs, N.J., 1988.

[3] S. Gill, "Parallel Programming", Computer Journal, Vol. 1, 1958.

[4] I. Glendinning & A.J.G. Hey, "Transputer Arrays as Fortran Farms for Particle Physics", Report SHEP 86/87-7, University of Southampton, 1987.

[5] A.J.G. Hey, "Parallel Decomposition of Large Scale Simulations in Science and Engineering", Report SHEP 86/87-2, University of Southampton, 1987.

[6] A. Holman, "At the Leading Edge", Parallelogram 4, June 1988.

[7] Inmos Ltd, occam2 Reference Manual, Prentice-Hall, Hemel Hempstead, 1987.

[8] Intel Scientific Computers, "iPSC-2", Intel International Ltd, Swindon, 1988.

[9] C.R. Jesshope, "Parallel Processing, the Transputer and the Future", Microprocessors and Microsystems, Vol.13, No.1, pp.33-37, January/February 1989.

[10] C.R. Jesshope & G. Panesar, "P1085: Somebody Else's Problem", Proc CONPAR 88, Cambridge University Press, Cambridge, UK, 1988.

[11] Meiko Ltd, "Achieving Results", Bristol, 1987.

[12] D.L. McBurney & M.R. Sleep, "Transputer-based Experiments with the ZAPP Architecture", Proc PARLE Conference (Eindhoven), Springer-Verlag LNCS 258, 1987.

[13] J. Packer, "Exploiting Concurrency: A Ray Tracing Example", Inmos Technical Note 7, Inmos Ltd, Bristol, 1987.

[14] S.L. Peyton-Jones, "Parallel Implementations of Functional Programming Languages", Computer Journal, Vol.32, pp.175-186, April 1989.

[15] A.V. Sahiner, A. Sherman & T. Kindberg, "Horsebox" (article on Equus), Parallelogram 14, May 1989.

[16] N. Tucker "Commercial Issues: Parallel Processing and the Transputer", Microprocessors and Microsystems, Vol.13, Special Issue: Applying the Transputer, March 1989.