SCIMA-SMP †
SCIMA-SMP : on-chip memory processor architecture for SMP †
We propose SCIMA-SMP: Software Controlled Integrated Memory Architecture for SMP which aims at higher performance and better scalability on SMP machines.
Motivation †
By the trend of semiconductor design and technology, the performance gap between processor and off-chip main memory is increasing. This problem is more serious in SMP (Symmetric Multi-Processor) machines because scalability of performance is limited by the bus utilization cycle which is strongly affected by the data transfer amount and pattern according to the characteristics of applications. Therefore, it is indispensable to make good use of cache and relax the stressful off-chip memory bus. However it degrades the processor performance seriously especially on HPC (High Performance Computing) applications where the cache does not work efficiently because data set is much larger than the capacity of cache and there is little temporal locality.
To solve the problem of cache, we propose SCIMA-SMP: Software Controlled Integrated Memory Architecture for SMP which aims at higher performance and better scalability on SMP machines. SCIMA-SMP is an extension of SCIMA, which is a processor architecture equipped with Software Controllable on-chip Memory (SCM) proposed in our previous work. Because data transfer between processor and off-chip memory is programmable by using SCM, we can optimize the timing and the granularity of the data transfer. Moreover, unnecessary data transfer caused by, for example, line conflicts never occurs.
Architectural overview †
Figure1: Address space of SCIMA-SMP
The basic structure of each processor in SCIMA-SMP is almost the same as conventional SCIMA. Each processor has its own cache and SCM. However, address space should be extended to handle the access to SCMs of the processors. Figure1 shows the outline of address space in SCIMA-SMP when the number of processors is four. A part of logical address is still assigned for maintaining access to SCMs. Here, this area is partitioned into several part and each part correspond to each processor's SCM.
The addresses of SCMs and off-chip main memory are exclusive on the logical address space. All processors share the address space of off-chip memory. All data on off-chip memory can be cached on each processor. Traditional coherence mechanism is supported among them. On the other hand, an SCM is individual storage resource for each processor.
The data transfer between SCM and off-chip main memory is controlled by software. User or compiler can specify the data to be transferred, the timing of the transfer, and the granularity of the transfer as well as the location on SCM. The data transfer is invoked by newly introduced instructions called page-load, and page-store. A word of page used in these instructions is introduced to represent a data chunk which is the target of data movement.
These instructions can identify large granularity of data transfer, which reduces the number of off-chip memory accesses for consecutive data access. These instructions also support block-stride data transfer which packs non-consecutive data of off-chip memory and transfers into a consecutive area of SCM. This feature is helpful for reducing the required SCM area and off-chip bandwidth.
Evaluation †
We evaluate the performance of SCIMA-SMP compared with cache-only architecture (hereafter expressed as CACHE-only). We assume the sizes of cache and SCM as shown in Table1. The total on-chip memory capacity assumed in this paper is 8 KB x 4 ways. Since SCIMA-SMP can reconfigure the capacity of SCM and cache, we assumed total 32KB of on-chip memory was used as 8KB direct-map cache and 24KB of SCM. The target problems are as follows.
- Matrix-Multiply
- expressed by C = A x B where A, B and C are matrices with the same size
- NPB kernel CG
- a kernel loop to compute eigenvalues of a large-scale symmetrical random sparse matrix with CG (Conjugate Gradient) method
- QCD (Quantum Chromo-Dynamics)
- simulation of interaction between quark and glueon within 4-D space
- Table1: Cache and SCM size
CACHE-only SCIMA-SMP Cache size 32KB (4way) 8KB (1way = direct-map) SCM size 0KB 24KB (`page`=8KB)
- Table2: Total bus traffic (4 processors, 40 cycles latency)
CACHE-only SCIMA-SMP Marix-Multiply 30.0MB 17.3MB NPB kenrel CG 8.9MB 6.9MB QCD 25.6MB 30.5MB
They are parallelized by using Pthreads (POSIX threads) which is the most common API for explicit shared address space programming. In the future, it is considerable to generate parallelized code suitable for SCIMA-SMP by OpenMP or other automatic parallelizing compilers. As for optimization for SCIMA-SMP, p-load and p-store instructions are inserted into source code by hand as function calls.
The three charts in Figure2 shows speed up ratio for three applications when cache line size is 32B. Off-chip memory access latency varies as 0, 10 and 40. The basis of y-axis is the execution cycles of CACHE-only model with 0 cycle latency. From the figures, it is observed that CACHE-only has good scalability if off-chip memory latency is 0 cycle. However, the scalability extremely degrades as increasing the latency. On the other hand, SCIMA-SMP has good scalability even if latency is 40 cycles. In CACHE-only, the performance of Matrix-Multiply with 8 processors is decreased about 48% from 0 cycle to 40 cycles while that of SCIMA-SMP is only 7%.
To analyze this reason, we evaluated the total bus traffic in under the condition where the number of processors is 4 and the memory access latency is 40 cycles. This is the heaviest condition from the viewpoint of bus traffic. The result is shown in Table2.
It is shown that CACHE-only causes 1.73 times bus traffic of SCIMA-SMP in Matrix Multiply and 1.28 times traffic in NPB Kernel CG. In this case, the block size of data transfer is optimized for each applications and each models. Therefore, it can be considered that the overheads that caused by line conflict on associative cache or capacity miss caused by unnecessary data transfer.
On the other hand, SCIMA-SMP causes 1.19 times bus traffic of CACHE-only in QCD. We consider that it is caused by the difference of the best block size between SCIMA-SMP and CACHE-only. In QCD, additional data is required to surround the actual target data. Consequently, the amount of actual data to be calculated on a block is smaller than shown in Table1. In this case, space on one block in SCIMA-SMP is one third of CACHE-only's one. However, the performance of SCIMA-SMP in QCD exceeds CACHE-only's performance in the case with more than 10 cycles of memory access latency. It indicates that SCIMA-SMP has any other benefits that exceed this disadvantage. In addition, SCIMA-SMP will improve the performance with larger SCM size, which is more realistic assumption.
As a result, it is shown that SCIMA-SMP is very robust to SMP configuration especially under the condition of long memory access latency and insufficient bus bandwidth. With its flexible and powerful data transfer feature for on-chip memory, it can perform both contiguous and block-stride access to off-chip memory with ideal granularity reducing the overhead on bus and memory transaction. Moreover, SCIMA-SMP can avoid redundant or unnecessary off-chip data transfer that are the serious problems on hardware controlled cache especially on SMP.
![[HPCS_Logo] [HPCS_Logo]](image/logo_hpcs.png)














