Next: 7 Planned Features
Up: Blocking Techniques for HPF
Previous: 5 HPF Blocking with
Subsections
This Section shows some example programs that benefit from
blocking techniques.
The following algorithm implements a matrix multiplication
for square matrices. In the three-dimensional loop nest
the array A is traversed times. If the array
A does not fit in the cache, a high rate of cache
misses can be observed. Though prefetch techniques can be
applied, the performance is not very high.
double precision, dimension (N, N) :: A, B, C
!hpf$ distribute A(*,block) ! one-dimensional blocking
!hpf$ distribute A(block,block) ! two-dimensional blocking
integer I, J, K
C = 0.0d0
!hpf$ parallel, reduction(C)
do J = 1, N
!hpf$ independent, on home (A(:,K))
do K = 1, N
!hpf$ independent, on home (A(I,K))
do I = 1, N
C(I,J) = C(I,J) + A(I,K)*B(K,J)
end do
end do
end do
By the block distribution of the array A it will be guaranteed
that always one block of array A can fit in the cache. This
results in much better performance.
Table 1:
One-dimensional HPF blocking of matrix multiplication ().
|
WALL_TIME |
MFLOPS |
LD_INS |
L1_MISS |
L2_MISS |
Bus |
NB |
ms |
|
M |
M |
M |
M / s |
ori |
3478.6 |
336.73 |
1174.12 |
223.54 |
24.93 |
42.47 |
1 |
3484.0 |
336.21 |
1174.12 |
223.54 |
24.89 |
42.38 |
16 |
1558.2 |
751.85 |
1178.72 |
224.62 |
5.71 |
19.90 |
24 |
1196.0 |
979.69 |
1180.68 |
224.66 |
2.41 |
9.89 |
32 |
1197.5 |
978.53 |
1182.64 |
224.09 |
2.68 |
10.53 |
64 |
1374.5 |
852.82 |
1190.46 |
224.92 |
4.94 |
16.20 |
|
Table 2:
Two-dimensional HPF blocking of matrix multiplication ().
|
WALL_TIME |
MFLOPS |
LD_INS |
L1_MISS |
L2_MISS |
Bus |
NB |
ms |
|
M |
M |
M |
M / s |
ori |
3478.6 |
336.73 |
1174.12 |
223.54 |
24.93 |
42.47 |
1x1 |
3479.3 |
336.66 |
1175.87 |
223.63 |
24.97 |
42.50 |
24x1 |
1476.7 |
793.88 |
1199.63 |
158.10 |
4.12 |
14.89 |
32x1 |
1149.5 |
1020.13 |
1203.55 |
157.14 |
1.83 |
8.95 |
1x32 |
1203.0 |
974.74 |
1203.55 |
157.46 |
2.24 |
10.59 |
8x8 |
1216.8 |
964.73 |
1235.73 |
161.02 |
1.68 |
8.53 |
8x4 |
1086.8 |
1078.90 |
1203.55 |
156.79 |
1.39 |
7.19 |
|
Gauss algorithm:
integer, parameter :: N = 1000
double precision :: A(N,N), B(N)
!hpf$ distribute (*,block) :: A
!hpf$ distribute (block) :: B
...
!hpf$ parallel
do I = 1, N-1
Q = 1.0d0/A(I,I)
!hpf$ independent
do J = I+1, N
B(J)=B(J)-B(I)*(A(J,I)*Q)
do K=I+1,N
A(K,J)=A(K,J)-(A(K,I)*Q)*A(I,J)
end do
enddo
end do
Table 3:
HPF blocking of the GAUSS algorithm.
|
WALL_TIME |
MFLOPS |
LD_INS |
L1_MISS |
L2_MISS |
Bus |
NB |
ms |
|
M |
M |
M |
M / s |
ori |
3535.0 |
359.01 |
956.55 |
117.65 |
36.46 |
46.44 |
1 |
3677.0 |
345.81 |
957.24 |
117.00 |
36.47 |
44.64 |
8 |
2904.7 |
473.81 |
959.28 |
116.53 |
30.10 |
46.50 |
16 |
1446.7 |
879.14 |
961.55 |
115.97 |
9.87 |
30.71 |
32 |
829.6 |
1533.36 |
966.08 |
116.42 |
1.10 |
8.13 |
40 |
796.2 |
1597.80 |
968.35 |
116.17 |
0.56 |
5.97 |
48 |
805.7 |
1579.18 |
970.62 |
116.25 |
0.60 |
6.30 |
64 |
840.1 |
1514.76 |
975.15 |
116.94 |
0.76 |
7.75 |
128 |
960.1 |
1326.52 |
993.29 |
119.32 |
1.38 |
12.46 |
256 |
1200.4 |
1062.78 |
1029.58 |
123.08 |
2.66 |
19.14 |
512 |
1698.4 |
753.55 |
1102.14 |
133.30 |
5.20 |
26.41 |
1000 |
2627.9 |
414.91 |
1240.46 |
141.54 |
10.31 |
33.86 |
int |
2186.7 |
581.50 |
957.02 |
118.66 |
10.08 |
39.71 |
|
logical, dimension (N) :: A
integer :: S
!hpf$ distribute A (block)
...
S = 0
!hpf$ parallel, reduction(S) begin
A = .true.
A(1) = .false.
do K = 2, N1
if (A(K)) then
where (A(K*K:N:K)) A(K*K:N:K) = .false.
end if
end do
!hpf$ independent
do I = 1, N
if (A(I)) S = S + 1
end do
!hpf$ end parallel
Table 4:
Sieving prime numbers ().
|
WALL_TIME |
LD_INS |
L1_MISS |
L2_MISS |
Bus |
NB |
ms |
M |
M |
M |
M / s |
ori |
10996.8 |
417.08 |
193.89 |
137.95 |
50.37 |
1 |
10979.8 |
327.51 |
193.46 |
138.32 |
50.46 |
700 |
6194.3 |
602.72 |
210.17 |
50.46 |
32.85 |
800 |
3057.8 |
642.07 |
212.58 |
7.06 |
10.54 |
900 |
2794.2 |
681.43 |
214.75 |
3.67 |
6.78 |
1000 |
2798.7 |
720.79 |
216.87 |
3.23 |
7.16 |
1300 |
2806.7 |
838.88 |
223.25 |
2.82 |
5.22 |
1500 |
2879.4 |
917.61 |
227.40 |
2.60 |
4.50 |
2000 |
2952.2 |
1114.42 |
234.82 |
2.09 |
3.56 |
2500 |
3187.8 |
1311.23 |
245.38 |
1.68 |
2.78 |
|
Next: 7 Planned Features
Up: Blocking Techniques for HPF
Previous: 5 HPF Blocking with
Thomas Brandes
2004-03-18