Transactions on Information and Communications Technologies vol 3, 1993 WIT Press, ISSN

A low cost, transputer based visual display processor G.J. Porter, B. Singh, S.K. Barton Department of Electronic and Electrical Engineering, University of Bradford, Richmond Road, Bradford, West Yorkshire, UK ABSTRACT As major computer systems and in particular parallel computing resources become more accessible to the engineer, it is becoming necessary to increase the performance of the attached video display devices at least pro-rata with those of the computational elements. To this end a number of device manufacturers have developed or are in the process of developing new display controllers, dedicated to the task of improving the display environments of the super computers in use today by engineers. There is however a price to be paid for this development, in terms of monetary cost and in the design effort involved in integrating the new technology into existing systems. This paper will present a solution to this problems. Firstly, by showing how to utilise available transputer technology to upgrade the display capability of existing systems by transferring some of the available processing power from the computational elements of the system to the display controller. Secondly, by utilising a low cost off-the-shelf Transputer Module based video display unit. The new video display processor utilises a three transputer pipeline formed from as Scan Converter Unit, a Span Encoder Unit and a Span Filler Unit, and can be further expanded by increasing the functionality and/or parallelism of each stage if required. 1. INTRODUCTION The continued expansion in the application of computer graphics to all aspects of the working life of engineers has placed ever increasing demands on display technology. Higher resolution and faster refresh rates are constantly required for new application programs, Graphic Environments such as X

154 Applications of Supercomputers in Engineering Windows and MS Windows, and user written programs. It is not possible to continually upgrade the graphics cards and display devices utilised in computer systems, leading to a display system that is less capable than that required by engineers. As part of the BRAD3D real-time image generation program, a system was developed that would allow the utilisation of existing low cost, relatively slow display cards, in a high speed display unit, called a Visual Display Processor (VDP). This unit utilises the existing display device as a simple back-end to the new system, performing all the necessary pre-processing in a two stage intelligent front-end. Leaving the slower display card with the simple task of drawing the actual horizontal spans into the display memory. 1.1 The Brad3D Project The BRAD3D project began at the University of Bradford in 1983, with the development of the first real-time image generation systemfl". This was constructed from two interconnected Motorola MC68000 (later MC68020), processors and a custom VDP system. The resolution of the unit was 256 by 256 pixels and it could, in 1987, manipulate a 3-D image constructed from 400 vertices, 160 polygons, in realistic real-time, (40mS frame interval). This was later improved by the addition of the XTAR[2] Graphics Co-processor and a four processor network, which increased the available resolution to 512 by 512 pixels and the processing rate to over 200 polygons in a 20mS frame interval. This was an eight fold increase over the previous system. Further improvements were made to the structure of BRAD3D and in 1991 a bit-slice processor was added to remove some of the computation from the Motorola MC68020 devices. This improved the through-put of the system but also highlighted its problems of extendability and flexibility. The system was seen as being inflexible and difficult to expand past its current size due to the fixed architecture of the dual VMEbus system employed. This instigated work on the current Transputer based system, described in greater detail in another offering to this conference, of which the VDP described in this paper is the last stage in the processing pipeline. 1.2 The Inmo$ Transputer The Inmos Transputer[3] is really a family of microprocessors, each sharing three basic facilities: a fast processor unit, on-chip memory and the ability to communicate bi-directionally with four other transputers. The transputer variant utilised in this project is the T800, which has both an integer (32bits) and a floating point (64bits) processor unit, giving it a sustained performance in excess of IMflop at 20MHz. To support this high computation rate the communication structure, via the serial links[4], can support a data rate of up to 2.3Mbytes per link. This together with the 4Kbytes of on-board memory and the availability of a hardware scheduler with a dual priority level, makes

Applications of Supercomputers in Engineering 155 the T800 a useful processor on which to develop parallel programs. The FLOATING POINT PROCESSOR SYSTEM SERVICES INTEC]ER PROC ESSOR 4Kloyte SRAM TIMER LINK LINK LINK 0 1 2 EXTERNAL MEMORY INTERFACE 32 bit Internal Hiahwav LINK 3 EVEf\ JT Figure 1.1 The Inmos T8QQ Transputer structure of the T800 is illustrated in figure 1.1. 1.3 The NT1 OOP Transputer System The Niche NT1000[5] is a multi-user transputer facility hosted by a SUN workstation. It is divided into four sites, each site being further divided into eight slots. One slot being able to accommodate a size one Transputer Module (Tram). An individual user has direct access to a single site, with up to eight processors if size one Trams are used. Alternatively, a user may utilise two or more sites, giving them access to a maximum of thirty two processors. Sites may not be split between users. Communication between processors within the same site is handled by a single IMSC004 Link Switch, as shown in figure 1.2. Each slot has two links, numbers 0 & 3, which can be configured to communicate with other similar slots within the same site. Alternatively, by using the Central Link Switch, a limited number of connections can be established between slots on different sites. Thus, most interprocessor communication paths can be established in this system.

156 Applications of Supercomputers in Engineering SLOT 2 SLOT 4 SLOT 5 SLOT 1 SLOT 0 SLOT 6 SLOT 7 SLO r 3 SLOT 2 SLOT 4 SLOT 5 SLOT 1 SLOT 0 SLOT 6 SLOT 7 SLOT 3 t SLOT 2 SLOT 4 SLOT 5 SLOT 1 SLOT 0 SLOT 6 SLOT 7 SLOT 3 1 SLOT 2 SLOT 4 SLOT 5 SLOT 1 SLOT 0 SLOT 6 SLOT 7 SLOT 3 SITE 0 SITE 1 SITE 2 SITE VMEbus INTERFACE Figure 1.2 Niche NT1000 Transputer System A single cycle or path exists between the processors in each site. This connects link 2 of each slot to link 1 of the next slot; with the slot that connects to the host machine using link 1 to establish this communication, and the slot that connects to the off-board interface using link 2. 1.4 Inmos B419 Graphics TRAM The IMSB419[6] is a graphics Tram, based on a single T800 transputer and a G300 Graphics Controller^?!. The structure of the unit is shown in figure 1.3. The transputer has access to two areas of memory, one to hold program & data and the second to act as a Frame Buffer for the image to be displayed. Both of these arrays of memory are 2Mbytes in size and are accessed on word boundaries in the transputers memory map. The G300 has access to the Frame Buffer memory via its Pixel Port. The frame buffer being constructed from Video RAMS, which have a high speed serial port that allows access to the memory at rates sufficient to drive a monitor at a horizontal resolution of 1024 pixels. Initial investigations of the B419 showed that to clear the screen when a screen size of 640 by 480 pixels took nearly 37mS when using the supplied CGI

Applications of Supercomputers in Engineering 157 graphics librariesfs] and was reduced to approximately 20mS when coded directly in assembly language and utilising the fast block MOVE[91 instructions. Upon Link 0..3 SYNC Figure 1.3 Inmos B419 Graphics Tram examining the timings of the dynamic rams used for the video memory, it could be seen that the 20mS performance figures were extremely close to the maximum obtainable due to the restricted memory bandwidth. The results of the initial tests, showed clearly that it was not possible to use the standard draw to Frame Buffer approach of clearing the buffer to a background attribute, and then writing each polygon to the memory in the order in which it was scan converted. Normally when this type of Painters algorithm is employed, the time to clear the video memory to a background attribute represents a small percentage of the available frame time, nominally 10-20%. Whilst in this case it was almost 60% of the frame time. Consequently, the three processor system described in this paper was developed, in an attempt to overcome the shortcomings in the design of the B419 graphics tram. 1.5 Overview The VDP developed to overcome the problems defined in section 1.4 above, is divided into three functional processes each mapped onto a single transputer. They produce a single pipeline as shown in figure 1.4, with the vertices entering at one end and the RGB signal to drive the monitor exiting at the other. The communication structure between these processes is described in section 2 bellow, and allows a fully decoupled system that takes full advantage of the transputers parallel communication-computation capabilities.

158 Applications of Supercomputers in Engineering SCAN RUNLENGTH SPAN VERTICES CONVERTER SPANS ENCODER RUN- LENGTHS FILLER RGB & SYNC Figure 1.4 Video Display Processor Pipeline Structure 2. COMMUNICATIONS The communication structure of any multiprocessor system has been shown to be of paramount importancef 10] to the efficient utilisation of the computing resources. The system trade-offs of computation to communication time lead to the design of a concurrent communication system in the Inmos transputer, the Link engine. These four devices are capable of transferring data to and from the processors memory concurrently with the processors activity, whilst stealing little memory bandwidth, and performing at up to 20Mbits per second. Figure 2.1 VDP Communication Structure To fully utilise the concurrent communication capabilities of the transputer it is necessary to have communication service processes executing concurrently with the main computation process. These are responsible for transferring new data into the processors memory and passing the results of the previous computations from the processors memory to the next processor in the chain. Therefore, during any single frame period, a new set of data is being read into a processor, calculations are being performed on the previous set and the results of the last calculations are being written to the next processor in the pipeline. The communication structure employed for the VDP can be seen in figure

Applications of Supercomputers in Engineering 159 2.1. The processes are indicated by circles, the processors by squares and the channels by arrows. The communication processes are shown as 'comm' and the three functional units as: SC - Scan Converter, RL - Span (Runlength) Encoder and SF - Span Filler. Communication between the communication processes and the functional units on each transputer is achieved via the channels shown in the figures and via double buffered memory arrays. These are not shown for clarity, but can be seen on the enlarged figure of each of the three transputers in the following sections. The double buffered array system works by giving each process access to only one of the arrays and restricting its access to read or write. Synchronisation of this access is always controlled by the functional unit, utilising the communication channel to synchronise with the communication process and to pass the start address of its portion of the buffer. For example, on the Scan Conversion processor a Double Buffered memory exists between the scan conversion process and the communication process that reads vertices into the processor. At the start of each cycle, the Scan Conversion process writes via the channel to the communication process and indicates the portion of the buffer to which it has write access. The communication process is then responsible for the transfer of the vertices over the link and their writing into the memory buffer. At the same time, the Scan Conversion process will be reading from the other half of the buffered memory and performing calculations. When the Scan Conversion process has completed its calculations, it will again synchronise with the communication process via the channel and pass to it the next address to which it has write access. The Scan Conversion process can then process the new set of data just read by the communication process. To ensure that communication does not have to wait for computation, the communication processes are assigned a high priority level and the functional processes, which do mainly calculation, a low priority level. This ensure that the whole system is not slowed when a single communication is delayed till a calculation is complete. 3. SCAN CONVERSION UNIT The images that are transferred to the Scan-Conversion unit via the communication structure shown in figure 3.1, are described as a series of polygons, located in the 2-D space by the {x,y} coordinates of their vertices. It is the function of the scan-conversion unit to convert this description of the polygons into a less accurate, but more complete, description based on the scanlines of the frame buffer memory. That is the polygon is converted into a list of {xo,xl} spans, one for each scan-line (y coordinate) of the frame buffer that overlaps with the {y} extent of the frame buffer. Such a mapping for a simple polygon is shown in figure 3.2. The mapping of the vertices to a list of span coordinates is considered to

160 Applications of Supercomputers in Engineering Figure 3.1 Scan Conversion Processor Structure be less accurate due to the limited resolution of the frame buffer. Typical frame buffers have horizontal resolutions of 512 or 1024 pixels, whilst a point on an edge of a polygon, defined by two vertices, has an infinite resolution. Therefore, some compromises have to be reached when mapping the polygon edge onto the frame buffer, leading to effects such as the staircase image seen when lines are drawn on low resolution displays. Figure 3.2 Example Scan Conversion of a Simple Polygon 3.1 Line Drawing Algorithms To construct the list of spans, the polygon vertex list is processed using modified line drawing algorithms. These produce a list of pixels for the coordinates of the span edges for the line between two vertices. By using these routines on all edges of the polygon, a list of the pixels that mark the edge of the polygon in the frame buffer can be derived. Two coordinates will exist for each {y} span-line; the left and the right edges of the span in the frame buffer. This technique is illustrated for a simple polygon in figure 3.2.

Applications of Supercomputers in Engineering 161 There are three main types of line drawing algorithms commonly used: the Bresenhamfll] line drawing algorithm and its derivatives, the Double Step Algorithm[12] and the Repetitive Sequence Algorithms! 13]. The Bresenham algorithm works by use of an error term to decide whether the next pixel is above or below the given line. That is it uses a single decision variable based on integer arithmetic. The Double Step algorithm extends this algorithm by using two decision variables to select one of four two pixel patterns. The Repetitive Sequence algorithms attempt to split the given edge into a number of repetitions of a sequence of horizontal and vertical movements. When this sequence has been established it can simply be repeated the given number of times. However, in some cases only a single sequence exists and thus there is no saving in time by no Q. (D O 1.08 106 104 102 100 200 400 Function(line length) Figure 3.3 Comparison of Bresenham and Double Step Scan Conversion Algorithms rapidly generating repeats, so incurring the additional computation overhead for no gain in performance. For this reason the Repetitive Sequence Algorithms were not considered. When comparing the Bresenham and Double Step algorithms, it can be seen that in all cases, the Double Step algorithm is superior to the Bresenham algorithm. This is due to it generating two pixels per iteration rather than one, even with the additional computation involved. For this reason the Double Step algorithm was implemented. The results of this test are shown in figure 3.3, which illustrates the -atio between the time taken for the Bresenham routine and

162 Applications of Supercomputers in Engineering the Double Step. 3.1.1 Modifications to Line Drawing Algorithms One problem commonly found with using line drawing algorithms is that they will generate multiple {x,y} coordinates for each scan-line. This can best be seen when a horizontal line, Figure 3.4 Modifications to Scan Conversion Algorithms shown in figure 3.4, is drawn between points {10,30} and {15,30}. This will result in the list {10,30}, {11,30}, {12,30}, {13,30}, {14,30} and {15,30}, while the single span line can clearly be described with the coordinate pairs {10,30} and {15,30}. To overcome this problem, the line drawing algorithms are modified to produce a single {x,y} coordinate for each span-line. To optimise the scan conversion algorithm, special tests were placed at the front of the control loop which detected horizontal, vertical and 45 degree lines and implemented special accelerated drawing routines for them. 4. SPAN ENCODER UNIT The output of the Scan Conversion processor was a list of spans which were to be drawn in the given order into the Frame Buffer memory. However, due to the limitations in the speed of access to the frame buffer in the B419 Tram, as mentioned in section 1.4 above, it was necessary to encode the spans into a form more suitable for rapid writing to the Frame Buffer. This was the function of the Span Encoder Unit, otherwise known as the Runlength Encoder because of its method of encoding the spans as runlengths of their attribute. The limits of the B419 Tram basically indicated that the image to be written to the Frame Buffer must be a combination of the actual image plus additional polygons that depicted the area of the 640 by 480 pixel screen not covered by the image, ie. the background. The Span Encoder was therefore required to encode the background into the runlengths as well as the actual image.

Applications of Supercomputers in Engineering 163 RUNLLNG^HS Figure 4.1 Scan Encoder Processor Structure 4.1 Dynamic Runlength Data Structure To enable the runlength algorithm to incorporate a variable number of spans, the exact number depending on image displayed, the clipping and hidden surface preprocessing utilised, a dynamic data structure was employed. This was constructed from two tables: a Line Table with one entry for each span line (y coordinate) in the display, and a Span Table with one entry for each of the spans to be displayed. At the start of each frame cycle, these tables were initialised to the state shown in figure 4.2, which illustrates how the background polygons were Line 0 Line 1 Line 2 Line 3 Line 4 Line 5 Line 6 H.^ 0-319 320-639 1-1 - 1 1 e, 1 1 H 1 H * - Line 477 Line 478 Line 479 1 i 1._ * 1 J LINE TA BLE SPAN TABLE NULL Figure 4.2 Dynamic Run-Length Data Structure

164 Applications of Supercomputers in Engineering incorporated into the span tree. The Background spans are shown as two spans of equal length per scan line due to a restriction of the Span Filler (see section 5), which limits the maximum span length to half of the display width. Spans to be included in the list are stored by the communication process in a Double Buffered array, see figure 4.2. They are stored strictly in the order in which they should be included in the span list, later spans having a higher hidden surface, or scan line priority than earlier spans. Each span in the list is processed and added to the correct place in the Span Table. Existing spans are reduced or overwritten to make room for the new span in the table. To minimise the amount of data copying when a new span is inserted, spans are simply linked into the Span Table without being moved from the span list. Therefore, a single pointer in each span needs altering to insert that span into the Span Table. (Not including any alterations to the spans already in the table). When all the spans have been inserted into the Span Table, it is necessary to convert the spans into a new runlength format. In the Span table each span is represented by a structure in the form: int xo; Least Significant X coordinate of the Span int xl; Most Significant X coordinate of the Span int y; Y coordinate of the Span int attr; Attribute of the Span SPAN *next; Link to next Span in the Table which needs to be converted into the runlength structure of the MOVE instruction as: int start; Start Address of the data to be copied int dest; Destination address to which the data is to be copied int length; Number of bytes to copy The actual implementation requirements of each of these entries will be described in section 5. 4.2 Runlength Processing Optimisation To process a large number of spans in a short period of time requires careful consideration of the use of the processors instructions and a compromise over the use of memory rather than processor time. Optimisation steps carried out in the development of this system include: (i) the saving of the structure shown in figure 4.2 and the copying of it back to the Line and Span Tables for each cycle, (ii) the use of tables to hold the start address for each scan line for both halves of the frame buffer, (iii) the use of a table to hold the attribute address start values for the Logical attribute array (see section 5) and (iv) the use of transputer assembly language for certain key areas of the system.

Applications of Supercomputers in Engineering 165 5. SPAN FILLER UNIT The final stage in the display pipeline is the Span Filler Unit. This receives runlength encoded spans from the Span Encoder Unit and writes them to the Frame Buffer memory. The structure of the Span Filler Unit is shown in figure 5.1. This unit makes use of two special features of the Transputer, but other similar features can be found in other processors. These features are: the internal 4Kbytes of memory that can be accessed in a single cycle (50nS at 20MHz) and the block MOVE instruction. Combining these two features allow the Span Filler to rapidly fill spans in the Frame Buffer by copying from the internal memory at speeds which approach the bandwidth of the memory. Figure 5.1 Span Filler Processor Structure 5.1 Logical and Physical Colours The B419 Tram has a colour look-up table (GLUT) of 256 locations, each 24 bits wide, giving a selection of up to 16 million colours. These are classified as the Physical Colours or attributes of the Span Filler Unit. Mapped onto these attributes are the 8 Logical Colours that the Span Filler can utilise during any single frame. The restriction to 8 colours is due to the limitations of the size of the internal memory of the transputer and the need to leave some memory area for the workspace of the processes. To form the Logical colours, a portion of the internal memory is formed into a 8 by 320 (half screen width) byte wide array. These are thenfilledwith the Physical colours that the user has assigned to each of the Logical colours; one 320 byte array for each colour. Therefore, the user can assign any Physical colour from the 16 million that can be programmed into the CLUT, to each of the Logical colours, with the restriction that there are only eight Logical Colours.

166 Applications of Supercomputers in Engineering 5.2 Operation of the Span Filler The Span Filler takes the list of runlengths from the Span Encoder Unit and using the MOVE instruction copies colour information from the internal memory to the Frame Buffer. For this reason the three parameters that make up the runlength are actually directly loaded into the three registers A,B,C of the transputer, and can be seen as: int start; Attribute array address in the internal memory int dest; Pixel address in the Frame Buffer int length; Number of bytes to copy 0 < length < 320 Thus to fill each span, the Span Filler needs only to carry out three loads and one move instruction, the rest is done automatically by the hardware of the Transputer. 6. PERFORMANCE ANALYSIS The analysis of the performance of any graphics system is extremely difficult, as by the nature of their use graphics systems are totally data dependent. However, in an attempt to gain some quantitative feel for the performance of the VDP an analysis was done of the one critical part of the system, the Span Filler Unit. This was considered critical, as it was the one part of the system that was fixed in the level of performance it could provide. 6.1 Mathematical Analysis From examination of the short piece of transputer assembly language that carried out the actual span filler function, equation 1, for the time to draw the whole screen in the Frame Buffer was developed. T = N \ T + PT 1 (1) screen * movesetup wplxel J ^ ' Where N is the number of spans, T^^p was the time to setup the move instruction, P was the number of pixels per span and T^«, was the time to write a single pixel to the Frame Buffer. With the knowledge that a single span line was 640 pixels wide and that 480 span lines made up a screen, this could be reformatted into the structure of equation 2 below. T = screen (480*640) 6.2 Experimental Analysis To measure the accuracy of equation 2, a series of span lists were written to the

Applications of Supercomputers in Engineering 167 Span Filler with values of P between 5 and 320. The results are shown in figure 6.1. Curve fitting equation 2 to these results and minimising, gives equation 3 shown below. 0.0188P + 0.9759 (3) Both the analytical and the experimental results show that provided the average span size is above 64 pixels, the system will operate within 2 refresh cycles per frame (34mS) and will be in effective real-time..3 Curve Fit to data B419 Timing Data 05 0 CD 05 100 200 300 400 P (Number of Pixels / Span) 500 Figure 6.1 Timing Analysis of Span Filler on B419 Tram 7. SUMMARY This paper has described a novel method of expanding the capabilities of existing low cost, but slow graphics facilities with the addition of inexpensive preprocessing systems. An analysis of the performance of the critical part of the system together with experimental results have also been presented.

168 Applications of Supercomputers in Engineering REFERENCES 1. Serra del Molino, L. A Multiprocessor System For Real-Time Image Generation, PhD Thesis, Information System Engineering, University of Bradford, 1987 2. Colman, T. and Powers, S. The Xtar Graphics Microprocessor', Byte, November 1984, pp. 179-186 3. Dettmer, R. The affordable gigaflop', IEEReview, March 1988, pp. 123-126 4. Taylor, R. Transputer communication link', Microprocessors and Microsystems, Vol. 10, No. 4, May 1986 5. Niche Ltd, The NT 1000 Hardware Manual' 6. Inmos Ltd., 'IMS B419-4 Graphics Tram user manual',october 1990 7. Inmos Ltd., ' The Graphics databook', Second edition, 1990 8. Inmos Ltd, The IMS F003a-c CGI Graphics Libraries Manual', 1990 9. Inmos Ltd., 'A Compiler Writers Guide', Prentice Hall, 1987 10. Milligan, P., Scott, N.S. Crookes, D. Kilpatrick, P.L. and Morrow, P.J. 'Network Topology A Critical Factor in the Implementation of Algorithms Intended for Efficient Execution on a Transputer Network', Microprocessing and Microprogramming, Vol. 23, 1988, pp. 253-258 11. Bresenham, I.E. 'Algorithm for computer control of a digital plotter', IBM System Journal, Vol. 5, No. 1, 1965, pp. 25-30 12. Wu, X. and Rokne, J.G. 'Double-step incremental generation of lines and circles', Computer Vision, Graphics and Image Processing, Vol. 37, March 1987, pp. 331-344 13. Rokne, J.G. Wyvill, B. and Wu, X. 'Fast line scan-conversion', ACM Transactions on Graphics, Vol. 9, No. 4, October 1990, pp. 376-388