PART 2: THE SOFTWARE
BYTE • SEPTEMBER 1985
Trevor G. Marshall, George Scolaro, David L. Rand, and Tom King are engineers with Definicon Systems Inc. Vincent P. Williams is president of Definicon. They can be contacted at 21042 Vintage St., Chatsworth, CA 91311.
Plug a 32-bit microcomputer into your IBM PC
Editor's note: This is part 2 of a two-part series describing Definicon Systems Inc's DSI-32 32032 coprocessor board for the IBM Personal Computer (PC) Part 1 included a description of the hardware, and part 2 will focus on the software available for the board If you are interested in a closer look at National Semiconductor's 32000 series, refer to "the National Semiconductor NSI6000 Microprocessor Family" by Glenn Leedy, April 1983 BYTE, page 53. You should keep in mind that the 32000-series chips are direct descendants of the 16000-series chips, and that the design philosophies and internal operation of both families are identical.
Definicon has developed an unusual software interface for the DSI-32. The traditional software environment for a 32-bit computer has been some version of UNIX: instead, Definicon opted for a "shell" program and compilers that emulated the important UNIX calls. This lets the DSI-32 run both UNIX applications and existing MS-DOS applications. An additional advantage is that the 32032 I/O (input/output) operations are off-loaded to the 8088, which performs the file and I/O processing concurrently.
This interface consists of two programs running in tandem On the 32032 side is a small operating system called 3210. On the 8088 side is a program called LOADer, and the two programs talk via an interprocessor communications area in the DSI-32's memory (see figure 1) When the DSI- 32 first starts, bootstrap code executes 3210 which waits until the 8088 has loaded the 32032 program into the DSI-32's memory. Once the program load is complete, LOADer tells 3210 to execute the loaded program. Thereafter, if the 32032 program needs to open a file, write a character to the screen, or perform any other I/O operation, it passes that request to 3210 which in turn passes it to the 8088.
The software interface uses the features of both the 8088 and 32032 environments. WADer, written in C, runs under MS-DOS (or Concurrent PC DOS) Since it manages most of the file I/O, it is relatively complex (about 35K bytes of code). The 3210 module handles most of the computationally intensive tasks. Together they form the interface detailed in table 1.
A program running on the DSI-32 makes I/O requests to the 3210 program via the 32032's SVC (supervisor call) instruction When an SVC instruction is executed, it initiates a supervisor call trap. This causes the 32032 to vector to a known location: it acts like a shorthand subroutine call. (The SVC instruction is directly analogous to the INT software interrupt instruction on the 8088/8086 processors.) To execute any of the 3210 program's service requests (as shown in table 1), you simply load the 32032's R0 register with a request number and execute an SVC instruction. (See the April 1983 article mentioned previously for more information on the 32032's components)
The interaction between the 32032 and the 8088 occurs at two levels. Some primitive 8088 functions, such as reading and writing blocks of memory or reading and writing I/O ports, have been made available to 32032 programs, but the majority of the interface uses the MS-DOS device-driver concept.
Most users are familiar with MSDOS devices such as CON and COM 1, but in addition to these basic system facilities, MS-DOS allows the installation of special device drivers, such as .SYS files, at boot time. For example, if a GPIB (general-purpose interface bus) card has been installed in this way, a user program could define a driver for a device called GPIB:, open it in the same way as opening any other file, and access the GPIB card without any need to know about special memory locations, I/O ports, or other system-dependent parameters. UNIX goes a step further, treating even CON: as the standard input device, which is labeled <STDIN>, so that all computer interfaces are accessed through the common file system. The Definicon MS-DOS interface has been designed to be similar to UNIX in this respect. The supervisor request to open a file (see request 5, table 1) opens the device whose name has been passed to it and returns a "handle" which is a 16-bit number assigned by the MS-DOS operating system to every file or device that a program has opened. Subsequently, when the program performs an I/O operation, the program informs the operating system where that I/O is to occur by passing it a handle. Detailed descriptions of handles can be found in most technical reference manuals for MS-DOS version 2.xx or higher.
Some handles are predefined and you need not explicitly open them to use them. For example, the handle for <STDIN>, which is usually the keyboard, is 0. The handle for the standard output device, <STDOUT>, usually the CRT (cathode-ray tube), is 1, and the standard error device, <STDERR>, is 2. You may redirect all these channels if you desire.
Thus, to write a character to the CRT using the interface, you would write inc C:
putc(char)
or in assembler language
MOVB 9,R0 ; Supervisor request 9, write to device MOVW 1,R1 ; Handle of <STDOUT> is 1 ADDR char,R2 ; Point at char to "put" MOVW 1,R3 ; Output only 1 character SVC ; Do supervisor call
The remaining supervisor calls may require fewer parameters, but they are accessed similarly.
All interfacing between the 8088 and 32032 takes place in the first page of memory, or physical address 002000 through 0F2000 (hexadecimal) as seen by the 32032.
[Editor's note: All addresses appear in hexadecimal unless specified otherwise.]
Figure 2 lists the significant locations within that memory area. lWo important communications areas in figure 2 are the service request word (SRW), and disk service request word (DSRW).
There is a 32032 SRW and an 8088 SRW. As shown in figure 3, these words act as mailboxes, holding messages that the two processors send to one another. For example. when the 32032 wants input from the keyboard, it first checks the 8088 SRW. If the SRW is clear - meaning no action is pending - the 32032 sets it to 1 and issues an interrupt to the 8088. The 32032 can then go do some other task, or just wait in a software loop for the 8088 to service the keyboard-input request.
The 8088, upon getting the 32032's interrupt, checks its own SRW and discovers that it has been set to 1. The 8088 then places the next character typed at the keyboard into the keyboard queue, sets the 32032 SRW to 1, clears its own SRW and sends an interrupt back to the 32032. When the 32032 processes this interrupt, it clears the 32032 SRW fetches the character from the keyboard queue. and continues with the main program. (Normally, the 8088 puts all characters in the keyboard queue anyway, even if the 32032 does not specifically ask for them.)
There is only one DSRW and it is a little more complicated than the SRWs (see figure 4). Basically, it holds a code number corresponding to the particular I/O operation that the 32032 needs the 8088 to perform on a handle, which typically corresponds to a disk file.
System call 17 (see table 1) moves memory from the 32032 to the IBM. This allows you to set up single or multiple 16K-byte graphics screen images in the 32032's memory space and swap them into the IBM PC's screen memory. It takes only a few tens of milliseconds to totally replace a PC's screen in this way. Thus, an application that, for example, requires animation, can use the 32032 to set up successive backgrounds in its own memory space, swapping only those portions of the screen memory actually necessary to update the foreground motion. When a background change is needed, it can be transferred to the display in a fraction of a second. Note that it would be possible to set up a large background space from which the 16K-byte screen display could be "zoomed" using the 32-bit integer manipulation capability of the 32032. Applications that have already been written (in a high-level language) can perform bit-mapped graphics with nearly identical code in the DSI-32 environment.
Early in the DSI-32 design, Definicon benchmarked a variety of hardware/software combinations. These benchmarks revealed huge speed variations that could only be attributed to compiler efficiency. For example, there was a large speed difference between the various implementations of the UNIX portable compilers, a difference that seemed to depend on the porting efficiency that the particular company was able to achieve.
To better evaluate compiler efficiency, Definicon examined the intermediate assembler source-code outputs and was able to divide the differences into the following two categories:
degradations in floating-point performance due to a compiler's inability to produce in-line source code.
degradations due to the inability of the compiler to optimize its output code.
Lack of in-line code often leads to a call to a subroutine every time a mathematical operation is performed. With in-line code, the compiler produces code to perform a mathematical operation-such as multiplication-every time it scans the operation. With the compact instruction set of the 32000 series, this reduces several lines of code to just one or two.
Lack of optimization is easily recognized. For instance, when compiling a benchmark that Definicon had devised to test floating-point operations, the UNIX portable C compiler produced the following code within the Z = Y * X execution loop:
MOVF _X,F0 ;get X from memory to floating register MULF _Y,F0 ;multiply it by Y MOVF F0,_Z ;store the result in Z (in memory)
The Green Hills compilers shifted X and Y into floating-point registers outside the loop and then performed the much faster.
MULF F0,F2 ;multiply where it stored X and Y MOVF F2,F4 ;and store it where it wants Z
(It was interesting to note that the VAX compiler optimizers were so efficient that they removed any operations that produced results that were not used later in the program. This caused a great deal of trouble until Definicon devised benchmarks to force the generation of real code.)
Several other differences between compilers were also evident. The 32032 has a "move quick" instruction for immediate operators in the range of +8 to -7, For example
MOVQD 6,R0
is only a 2-byte instruction that moves 6 into register R0 as a 32-bit ("double") quantity. If the operand is out of this range, most other compilers generate:
MOVD 34,R0,
which takes longer to execute than the previous instruction because the 34 is stored as a 32-bit, immediate double word.
The Green Hills compilers, however, generate
ADDR @34,R0
which tells the CPU (central-processing unit) to calculate the effective address of 34, which is of course 34, and place the resulting 32 bits in R0. This construct requires only a 1-byte field for the immediate value and therefore executes faster.
Consequently, Definicon selected the Green Hills C, Pascal, and FORTRAN compilers for the DSI-32. Since these compilers were written in Pascal. Definicon attempted to port them by compiling them in an 8088 Pascal. This did not work because of segmentation constraints on the 8088 (see the text box "The Need for Speed" on page 124), so Definicon wrote an interface package that was used to port the Pascal compiler to the DSI-32. When Pascal was running successfully on the board, it was used to bring up the C and FORTRAN compilers.
FORTH-83 defines integers and addresses clearly as 16-bit words. This severely limits FORTH's usefulness in a full 32-bit environment. Because each memory access on the DSI-32 (including, of course, stack accesses) is 32 bits wide, it makes no sense for a 32032 FORTH implementation to observe these restrictions.
A 32032 FORTH is available for the DSI-32 board from Symbolic Processing Systems of Orange, California. Symbolic Processing Systems' FORTH defines all integers and addresses as 32 bits wide. In addition, the storage order of bytes in memory has been changed to match that of the 32032, [Editor's note: The 32032 stores all multibyte quantities with the least significant byte at the lowest address.] This version includes the FORTH editor, string and file support, together with all floating-point operations.
Serious FORTH programmers will be interested in the upgraded FORTH package, complete with an assembler, debugging aids, the metacompiler, and FORTH source screens. The metacompiler allows recreation of a customized FORTH from its source screens. The debugging tools, including TRACE and VIEW facilities and a decompiler, provide a full-featured 32-bit FORTH programming environment. In addition, this FORTH is multitasking, with the number of tasks limited only by the available memory space, (See the text box on page 134 of the August BYTE for price information.)
Additionally, a small LISP interpreter and a small BASIC compiler, both written in FORTH, are available for the OSI-32.
Coupled with the UNIX-like shell and high-level language compilers, the OSI- 32 board performs on a level previously occupied by expensive minicomputers. It offers IBM PC owners 32-bit computing - an upgrade that doesn't cause the usual problem of equipment obsolescence. A FORTH interpreter is available for S299 from Symbolic Processing Systems, 501 West Maple, Orange, CA 92668, (714) 637-4298.
The authors are indebted to Martin A Lewis of Cambrian Consultants Inc. of Calabasas, California, for his help and guidance, and to Les Wilson, an applications engineer at National Semiconductor, for his untiring assistance.
0 | BOOTSTRAP | RAM SPACE |
000010 | RESERVED | |
000020 | MODULE TABLE AREA | |
002000 | INTERPROCESSOR COMMUNICATION AREA | |
003000 | 3120 (32032 I/O KERNEL) | |
004000 | RAM | |
100000 | FREE | |
200000 | ||
300000 | ||
400000 | ||
500000 | ||
600000 | ||
700000 | ||
800000 | ||
900000 | ||
A00000 | ||
B00000 | ||
C00000 | ||
D00000 | ||
E00000 | DIAG VECTOR PAL | I/O SPACE |
E00080 | EXPANSION PORT (GPIB) | |
E00100 | DUART | |
E00180 | ||
F00000 | FREE |
Figure 1: The DSI-32's memory map.
3210 is the I/O kernel that resides between locations 3000 and 4000.
All addresses are in hexadecimal.
Table 1: Functions offered by the 3210 program.
Each supervisor request is called by placing the request number in the 32032's R0 register and executing an SVC instruction. 3210 passes requests to LOADer (running on the 8088), which accesses the actual device. Note that many supervisor requests require additional information in other registers.
Supervisor Request Function 5 Open a file or device for I/O 6 Close a file or device 7 Create a file 8 Read from a file or device (up to 65536 bytes) 9 Write to a file or device (up to 65536 bytes) 10 Erase a file 11 Rename a file 12 Seek to a byte in a file (The offset to seek to can be from beginning of file, current position in file, or end of file.) 13 Return current position in file 14 Get command-line arguments 15 Terminate 32032 execution 16 Move data from IBM to OSI-32 (up to 65536 bytes) 17 Move data from OSI-32 to IBM (up to 65536 bytes) 18 Input from port on IBM 19 Output to port on IBM 20 Execute a software interrupt on the IBM
Memory Contents 2000 Service request word (SRW) indicates what 8088 wants 2002 8088 completion status 2004 Request word indicates what 32032 wants 2006 32032 completion status 2008 8088 queue pointer (→2050) Keyboard queue 200A 32032 queue pointer (→2050) 200C 8088 queue pointer Video queue 200E 32032 queue pointer 2010 8088 queue pointer Printer queue 2012 32032 queue pointer 2014 Disk service request word (DSRW) 2016 Current handle 2018 Number of bytes to transfer 201A DWORD (double word) pointer to DTA (disk transfer area) 201E Disk completion status 2020 Heap 2024 Heap 2028 Stack 202C Stack 2030 Reserved for expansion 204F Reserved for expansion (32 bytes) 2050 54-byte keyboard buffer 2090 2048-byte video buffer 2890 1024-byte printer buffer 2C90 Reserved for expansion 2FFF Reserved for expansion (880 bytes) 3000 Start of free memory
Figure 2: Memory on the DSI- 32 can be accessed by either the on-board 32032 or the PC's 8088, and the two processors use the memory for communication with one another.
Some of the more important locations used are shown above.
Service Request Word (SRW) Definitions SRW Action 1 Keyboard input 2 Video output 3 Printer output 4 Disk operation requested 5 Argument request (get command-line values) 7 Task completed normally FFFF Abnormal task completion, see completion byte for details
Figure 3: Possible values for the service request word and their meanings. Each value corresponds to a specific request made by one processor to the other.
The NS32000 instruction set eases the task of writing efficient compilers. As an example, let's look at a simple subroutine in C that normalizes the size of a positive floating-point number to be between 0.5 and 1 and returns the corresponding scaling exponent such that
input argument = normalized result * 2^ exponent
This subroutine, in C, might be written as shown in listing A.
A more efficient means of performing this function is to examine the way in which the IEEE (Institute of Electrical and Electronics Engineers) standard specifies a floating-point number (see figure A). The number is made up of a sign bit, an 11-bit exponent, and a 52-bit mantissa. We can write figure A as a structure in C (using bit fields)
typedef struct { unsigned mantissa 1 : 32 ; unsigned mantissa 2 : 20 ; unsigned exponent : 11 ; unsigned sign : 1 ; } double_precision_ieee;
When the ormalize subroutine is called, this number will have been pushed onto the stack by the calling routine. All the Green Hills compilers use identical calling procedures, so the code and subroutines from each can be intermixed. Using this new structure, we could rewrite the subroutine as shown in listing B.
The Green Hills C compiler generated the 32032 machine code in listing C.
We leave it to you to check the assembler code your favorite compiler generates with this routine. With our 8088 and Digital Research C, this source code generated about 22 lines of assembler, plus a call to a library function.
S | E | F | |
Bit: | 63 | 62... | 51.......................0 |
NUMBER = (-1)S × 2(E-1023) × 1.F2
Figure A: IEEE standard for 64-bit floating-point number.
The S bit indicates the number's sign: 0 is positive and 1 is negative.
The E or exponent field is in excess-1023 form
(subtract 1023 from
the value in the E field to get the actual exponent).
The F field holds the fractional portion of the number.
Listing A: A sample C program to normalize a floating-point number to between 0.5 and 1.
double normalize(y,exp) double y; int *exp; int scaler = 0; { if (y > 1.0) while (y > 1.0) {scaler = scaler + 1; y = y/2.0; } else if (y < 0.5) while (y < 0.5) {scaler = scaler - 1; y = y*2.0; } *exp = scaler; return(y); }
Listing B: A modified version of listing A, using a C data structure.
double normalize(y,exp) double y; int *exp; { /* create a pointer to the argument passed on the stack */ register double_precision_ieee *pointer = &y ; /* calculate an exponent value by subtracting 1023 */ /* (all ones) from the argument's exponent value */ *exp = pointer -> exponent - 1023; /* and normalize the argument's exponent to all ones */ pointer -> exponent = 1023; /* return the modified number (and of course, the exponent)*/ return(y); }
Listing C: 32032 machine code generated by the Green Hills C compiler for the program shown in listing B.
.module normalize.c .program _normalize: - addrd 8(sp),r1 ;first line of C code done extsd 6(r1),r0,4,11 ;EXTract Short Double ;from bit one, length of 11 ;from the 32-bit word at (R1)+6 ;put them into bits 0-10 of R0 subd 1023,r0 ;next line of C code movd r0,0(16,sp)) ;put it onto the stack for return inssd 6(r1),4,11 ;INSert Short Double ;syntax similar to above extsd ;except that an immediate field ;inserted movl 8(sp),f0 ;get Y to F0, the return code ;for the calling program rxp 0 ;_pointer r1 local ;_y 8(sp) local ,_exp 6(sp) local .endseg
Early in 1984, Definicon Systems Inc. developed an advanced algorithm for spectral decomposition and performed most of the development work on a VAX-11/780. When that became too expensive, Definicon turned to an HP9000 32-bit super-microcomputer. After a few months Definicon realized that owning an IBM PC clone would be much cheaper than leasing the HP9000.
Before deciding to go with the IBM PC architecture, Definicon performed benchmarks on a range of machines, from the VAX minicomputer, through 68000 UNIX-based machines, to PCs The benchmark programs included the prime-number sieve, as well as a program devised specifically to test the arithmetic processing unit's speed. The early results indicated than an IBM PC clone was capable of providing about the same performance as the more expensive machines, So Definicon bought an Eagle Turbo, which had about twice the measured speed of a basic IBM PC XT.
When Definicon converted the algorithms to Microsoft FORTRAN, they were unpleasantly surprised to find that the benchmarking had been misleading. The algorithms took about 90 minutes to run on the 8086 - about 5 times longer than the benchmarks had indicated. It took Definicon several months to find the problem.
The prime-number benchmark - the popular "Sieve of Eratosthenes" (see "Eratosthenes Revisited: Once More through the Sieve" by Jim Gilbreath and Gary Gilbreath, January 1983 BYTE, page 283) - initializes a Boolean (byte) array to logical TRUE then blanks out the nonprime numbers one by one, leaving only the primes unmarked.
For several reasons, the Sieve of Eratosthenes turns out to be a particularly bad indicator of advanced processor performance, For one thing, the usual array size is 8191 numbers, a value that exercises only the 16-bit arithmetic capabilities of the CPU. In addition, the array is Boolean, a byte array that means the 32-bit processors usually discard 3 of the 4 bytes they fetch at each memory access.
The true measure of CPU performance, however, can be glimpsed when the array grows. When we try to find the primes in the first 40,000 numbers, for instance, a different picture emerges. The IBM PC AT (and the XT) drops to about 118 the performance of the 32- and 64-bit machines when 32-bit arithmetic has to be used (numbers are larger than 32,767).
A fundamental limit of the segmented architecture (found in the 8088/8086/80286 CPUs of the IBM PCs) is reached when the array extends to 65,536 numbers, Definicon could not find a compiler for the 8088/8086 that could deal with arrays with more than 216 elements.
Another major limitation of the segmented architecture, however, is not shown by the Sieve. The 8088/8086 and 80286 have a data space that only spans 64K bytes at a time. To handle data structures larger than 64K bytes, these CPUs must employ lengthy tests whenever a data byte is fetched. These tests check to see if the byte is from the current data segment and, if not, must switch the processor to the needed segment.
Although Definicon was interested primarily in scientific applications, the performance conclusions generally apply to many business software applications. In particular, spreadsheets and database managers are slowed considerably by the 64K segmentation overhead.
The 32032 architecture avoids the segmentation delays. Every module of code written for the 32032 has a module table. Each module table has four entries: static base, link table base, code base, and one reserved entry. The 32032 supports a "call external procedure" mechanism that uses the module table for a fast context change. A routine may also access an external variable if it knows both the module in which the variable is declared and the offset (to the variable location) at which the required data is stored.