GIER/Turbo GIER ALGOL 4
In the article The Design of the GIER ALGOL Compiler, Part II Peter Naur writes the following:
... For these reasons we wish to base our assessment of the performance of the system on analyses of the time spent on the various language functions during the execution of realistic, practical algorithms. By this approach we will be able to detect the bottlenecks of the execution. Analyses of this kind have been made for algorithms for inverting matrices, for finding eigenvalues of symmetric matrices, and for definite integration by Simpson's rule. These analyses very definitely point to two bottlenecks: (1) subscripting, and (2) transfer of control to a segment which is already present in the core store. The relative importance of these two items varies greatly, not only with the program, but also with the manner in which the program happens to be segmented. However, as a rough average it appears that these two items together account for well over half the execution time of many realistic programs. As a further conclusion it may be stated that such programs might be speeded up by a factor of two or more if two or three special machine instructions designed to take care of these two problems were included in the machine.
I've tried to take a look at point no. 2, transfer of control to a segment already in core. Not on the real GIER machine, but on the simulator. The idea is to add an extra instruction to speed up the transfer.
Transfer of control
The segmentation of the compiled GIER ALGOL program works like this:
- The compiler delivers the compiled program on the drum in tracks of 40 words.
- Each track begins with the constants needed for this segment, followed by the generated instructions.
- The first track is brought into core when the execution begins, and control is passed to the first instruction in the segment.
- The last instruction in the segment contains a call into the Running System (RS, the part of the GIER ALGOL system resident in core). RS looks up the next segment number in a table, and if it is already in core, transfers execution to the segment. The segment is loaded from the drum if it is not present.
- RS keeps track of the segments in core when the stack grows and releases the program segments as needed.
- A priority system is used to release segments from core when new segments are brought in.
We take a look on what happens when execution is transferred from on segment to another segment already in core:
The last word of the segment contains a subroutine call to the label c1 in RS:
hs c1, xxx
The right halfword of the cell contains the number of the first executable instruction in the next segment as the address. The word is f-marked (bit 41) if control is to be passed to the right halfword of the cell.
The sequence c1 in RS contains the following:
c1: gr b21 , arn s ; save R:= R; gt b9 IRC ; rel addr:= set RC (relpart(R)); arn(b1) D 1 ; Raddr:= track:= track + 1; d19h:hv (b7) , vy[*errunit]; _g_o_t_o table search [top table]
The first word saves the R register and fetches the calling hs instruction. The second word saves the relative address in b9. The third word increases the track number in b1 and transfers it to the address part of the R register. Finally, control is transferred to the address in b7 which points to the start of the table of segments in core.
The track table is a series of instructions like this:
ca 0 , hh 138 ca 1012 , hh 97 ca 1011 , hh 56 ca 1010 , hh 15
The left instruction compares the address of the instruction with the number in the address part of the R register. If they differ, the right instruction is skipped.
If the segment is not found in the table execution continues with code that loads the segment into core.
So if the address part of R contains 1011, control is passed to the right hand instruction in word 56. The segment 1011 starts at word 57. The right hand instruction in cell 56 contains a hs call of a10:
a10: d7=a10-2[used by a5] ; b9: gs b11 , ps s _-_1 ; track is in: curr place:= s; a11: [rel addr] ; set exit addr: s:= s + rel addr; b10: arn_-_4_9_2[priority] D 1 ; set priority: priority:= Raddr:= priority + 1; c65: ; b11: ga _-_1[curr place] VNO ; _i_f priority = 512 _t_h_e_n arn b2 , hv a6 ; _b_e_g_i_n Raddr:= top place; _g_o_t_o clean up _e_n_d; arn b21 ; R:= save R; hh s1 LRC ; _i_f go _t_h_e_n _g_o_ _t_o _i_f LRB _t_h_e_n right s1 hv s1 LRA ; _e_l_s_e left s1;
Note that b9 contains the offset into the segment where execution should start. The code increases the priority of the segment with one (cleans up the priority list if 512 is reached) and transfers control to the left or right instruction depending on the flag RB which is set if the hs c1 was f-marked.
If the core store contains many segments the list of ca's will be long, and it takes time to go through these instructions.
Implementing the PC instruction
In order to speed up the process of going through the ca instructions I've tried to implement a new instruction on the GIER machine, the pc instruction.
This instruction should be called with the address of the first ca instruction in the track table. The pc instruction fetches the words pointed by the address one by one and compares the address with the address in R. If a match is found, control is transferred to the right hand instruction in the address found in the right hand side of the cell.
The list is exhausted when a word is loaded that does not have bit 40 set, as the first instruction after the ca table is a full word instruction.
I've used the microcode of the ca instruction as a template, you can find this on page 120 in Teknisk beskrivelse af Gier, bind 1.
The microcode for the pc instruction looks like this:
MA | Gs | Gm | FL | Betingelse | nr. | Hop | Kommentar |
---|---|---|---|---|---|---|---|
1 | AC00-39.H40-41 sub |
H00-39 | 2 | H:=R; | |||
2 | AD1 | OT | 3 | r1:=r2; | |||
3 | OT | AD1 | 4 | loop: r2:=r1; | |||
4 | step | læs0-41 | 5 | LI:=ferrit[r2]; | |||
5 | step | 6 | |||||
6 | H00-39 0h Mode1 |
AC00-39 step |
¬LI40 | 1 | if -,LI[40] then begin R:=H; h:=false; Mode1 end; | ||
6 | LI0-9 o10-41 |
MD00-9 MD10-39 step |
LI40 | 7 | if LI[40] then O:=LI[0-9]; | ||
7 | Add00-19 | AD2 | 8 | s2:=MD+H; | |||
8 | AD2 o10-41 |
AC00-9 AC10-39 |
9 | R:=s2; | |||
9 | Tæl i OT | AC≠0 | 3 | if R≠0 then begin r1:=r1+1; goto loop end; | |||
9 | LI10-19 | TD | AC=0 | 10 | if R=0 then TD:=LI[10-19]; | ||
10 | TD skrå Mode 1 1h |
OT step |
1 | r1:=TD; h:=true; Mode1; |
I've tried to follow the conventions for microcode listed on page 50 in Teknisk beskrivelse af Gier, bind 1.
The GIER ALGOL 4 compiler needs to be patched in order to use the new instruction. The tape T1, L1 (26) 20.07.70.flx is modified like this:
flx2a <"T1, L1 (26) 20.07.70.flx" | tac | sed '0,/_s/{//d;}' | tac | \ sed -e '/_i redefine/{ N; s/_s/e14=117,e27=1/g }' | \ sed -e 's/hv(b7)/pc(b7)/g' | \ sed -e 's/c5:hv_d_2_-_3,arnb5/c5:pc_d_2_-_3,arnb5/g' >BUILDga4drumbufturbo.asc
The instruction hv(b7) occurs twice.
Testing Turbo GIER ALGOL 4
I've tried a series of GIER ALGOL 4 programs on four different GIER machines, two with buffer and two without buffer, and two with a classic GIER ALGOL 4 compiler and two with a turbo GIER ALGOL 4 compiler.
The results are listed in the table below. The speedup, in percent, is listed as well. You can click on the .asc file to see a listing of the program as a PDF file. Note that some of the lines are truncated at the margin.
Program | Function | Time, sec. GA4 Classic |
Time, sec. GA4 Turbo |
% | Notes |
---|---|---|---|---|---|
demon5tn.asc Calculation of large numbers Written by Jørgen Kjær with improvements by Thorkil Naur |
sqrt(r) | 5331.8 | 4923.3 | 7.7 | 380 decimals, buffer GIER no index check |
sqrt2(r) | 1247.6 | 1077.9 | 13.6 | ||
sqrt3(r) | 389.2 | 365.0 | 6.2 | ||
sqrt(B) | 388.1 | 363.9 | 6.2 | ||
sqrt(r) | 8195.6 | 7892.0 | 3.7 | 380 decimals, no buffer GIER no index check | |
sqrt2(r) | 2147.2 | 2008.1 | 6.5 | ||
sqrt3(r) | 379.9 | 364.9 | 3.9 | ||
sqrt(B) | 377.6 | 361.5 | 4.3 | ||
enigma1.asc Solve Geocache GC84BA2 |
9439.92 | 8811.91 | 6.7 | No buffer GIER | |
9583.25 | 8912.39 | 7.0 | Buffer GIER | ||
inv1.asc Matrix inversion Topsøe INVERT2 algorithm |
N=61 Buffer |
603.71 | 599.31 | 0.7 | No index check |
1407.11 | 1250.24 | 11.1 | Index check | ||
N=21 No buffer |
36.59 | 36.05 | 1.5 | No index check | |
67.66 | 64.94 | 4.0 | Index check | ||
inv1ga3.asc inv1.asc run on GIER ALGOL III |
N=21 | GA3: 60.49 | |||
lorenz2.asc Solve Geocache GC7J6KQ |
12949.57 | 12120.39 | 6.4 | No buffer | |
3397.18 | 3077.07 | 9.4 | Buffer | ||
lorenz3.asc Solve Geocache GC7J6KQ |
2402.93 | 2295.14 | 4.5 | No buffer | |
2427.87 | 2274.49 | 6.3 | Buffer | ||
nqueen18.asc | N=12 | 13968.89 | 12368.89 | 11.5 | No buffer |
14069.38 | 12469.37 | 11.4 | Buffer | ||
pc1.asc Linear equations. Topsøe LEQ1 algorithm ("Det Gauss" without determinant calculation). end for added on central loops. |
N=20 | 14.112 | 14.004 | 0.8 | No buffer |
N=20 | 12.920 | 12.783 | 1.1 | Buffer | |
N=60 | 262.368 | 261.448 | 0.4 | Buffer | |
pc1b.asc Like pc1.asc, but without first end for |
N=60 | 408.109 | 375.170 | 8.1 | Buffer |
pc3.asc a:=sin(0.1)+exp(0.1)+sqrt(0.1)+arctan(0.1) |
N=10000 | 184.573 | 180.017 | 2.5 | No buffer |
184.588 | 180.031 | 2.5 | Buffer | ||
pc4.asc a:=sqrt(0) |
N=30000 | 52.809 | 50.206 | 4.9 | No buffer |
52.822 | 50.218 | 4.9 | Buffer | ||
pc5.asc Project Euler problem 61 |
No index check | 7938.42 | 7036.65 | 11.4 | No buffer |
6655.42 | 6091.44 | 8.5 | Buffer | ||
Index check | 7893.65 | 7045.71 | 10.7 | No buffer | |
8032.21 | 7153.58 | 10.9 | Buffer | ||
pc6.asc Project Euler problem 112 |
No end for | 30380.22 | 28390.72 | 6.5 | No buffer |
30569.63 | 28580.12 | 6.5 | Buffer | ||
With end for | 25329.12 | 24158.28 | 4.6 | No buffer | |
25518.52 | 24347.68 | 4.6 | Buffer | ||
pc7.asc Project Euler problem 12 |
38157.34 | 38153.31 | 0.01 | No buffer | |
38154.11 | 38150.08 | 0.01 | Buffer | ||
pc8.asc N a:=b in loop No buffer |
N=640 | 659.797 | 612.932 | 7.1 | |
N=680 | 689.896 | 642.808 | 6.8 | ||
N=720 | 729.221 | 679.751 | 6.8 | ||
N=760 | 5649.694 | 5649.645 | 0.001 | ||
N=800 | 5908.468 | 5895.573 | 0.2 | ||
pentomino4.asc Pentomino from DR program |
428386 | 408163 | 4.7 | No buffer | |
280782 | 251104 | 10.6 | Buffer | ||
sudoku7.asc | tracks transferred: 972876 |
24794.2 | 24784.9 | 0.04 | No buffer |
tracks transferred: 6255 |
3407.5 | 3097.7 | 9.1 | Buffer | |
sudokus7.asc SLIP version of sudoku7.asc |
1122.83 |