Omni extensions for for Cluster-enabled OpenMP/SCASH

Performance tuning for Omni/SCASH

In SDSMs, the cost of page misses is very expensive. To tune the performance for Omni/SCASH, consider using "threadprivate" as possible as you can. The threadprivate data is placed in the local memory so that the data can be referenced without any overhead.

The next keys are the data mapping and loop scheduling. In SDSMs, the home node allocation of pages affects the performance because the cost of consistency management is large comparing to hardware NUMA systems. In SCASH, a reference to a page in a remote home node causes page transfer through the network. When the home node of a page is different from the current node, the modified memory must be computed and transfered at barrier points to update the page in remote nodes. SCASH can deliver high performance for OpenMP program if the placement of data and computation is such that the data needed by each thread is local to the processor on which that thread is running.

In Omni/SCASH, the default data mapping is "block" mapping. The home node of large array objects is allocated in "block" distribution.

In OpenMP, a programmer can specify thread-parallel computation, but its memory model assumes a single uniform memory and provides no facilities for laying out data onto specific distinct memory space. And, no loop scheduling method is provided to schedule in a way that recognize the data access made by that iteration.

We have extended with a set of directives to allow the programmer to specify the placement of data and computation on the shared address space. In Fortran, Omni extensions

The format of Omni extension directives are as follows:

In Fortran:
   !$OMN  directive_names

In C:
   #pragma omni directive_names

Data mapping directive

The data mapping directive specifies a mapping pattern of array objects in the address space. The data mapping can be applied only for the array data in global-scope.

Syntax in Fortran:
  !$OMN mapping(map_item,...)
  !$OMN mapping(alignee:align_target)

map_item:= array_name(mapping_subscript,...)
mapping_subscript := block | cyclic | cyclic(chunk)| *

alignee:=array_name | array_name(align_subscript[,align_subscript,...)
align_target := array_name | array_name(align_subscript,...)
align_subscript :=  expr |*

Syntax in C:
  #pragma omni mapping(map_item,...)
  #pragma omni mapping(alignee:align_target)

map_item:= array_name[mapping_subscript]...
mapping_subscript := block | cyclic | cyclic(chunk)| *

alignee:=array_name | array_name[align_subscript]...
align_target := array_name | array_name[align_subscript]...
align_subscript :=  expr |*

chunk is an integer to specify the chunk size. expr must be a expression in the form of scale*identifier+offset. The expression of align_subscript in alignee must be an identifier associated with the identifier in align_target. The asterisk (*) means that the elements in any given column should be mapped in the same node. The block keyword for the second dimension means that for any given row, the array elements are mapped on each node in large blocks of approximately equal size. As a result, the array is divided into contiguous groups of columns, with home nodes for each group assigned to the same node. The keyword cyclic(n) can be used to specify cyclic mapping.

Note that, in the current implementation, data mapping for only one of the dimensions can be specified because our current OpenMP compiler supports single level parallelism. In the current implementation, scaling in align_target is not supported.

For example, the following directive specifies the block mapping with the second dimension of two dimensional array A:

In Fortran:
      dimension A(100,200)
!$omn mapping(A(*,block))

In C:
double A[200][100];
#pragma omni mapping(A[block][*])

The following example uses the alignment mapping for array A.

In Fortran:
      dimension B(100,200)
!$omn mapping(B:A)

In C:
double A[200][100];
#pragma omni mapping(B:A)

The array B have the same mapping.

The next example uses the alignment with an offset:

In Fortran:
      dimension C(200,200)
!$omn mapping(C(*,i):A(*,i+1))

In C:
double C[200][200];
#pragma omni mapping(C[i][*]:A[i+1][*])

The index identifier of alignee and align_target must be the same. In this example, the column of C[i][*] is mapped to the processor in where the column of A[i+1][*] is mapped for any i.

Since the consistency is maintained on page-basis in SCASH, only page-granularity consistency is supported. If mapping granularity is finer than the size of page, the mapping specification may not be effective.

If multiple mappings are specified to the same data in different files and different COMMON declarations in Fortran, one of the mapping can be effective, depending on the order of linking process.

The syntax is borrowed from High Performance Fortran(HPF). Different from HPF, however, each processor may have the entire copy of the array in the same shared address space. In this sense, this directive specifies ``mapping'' in the memory space, not ``distribution'' in HPF.

Affinity scheduling for parallel loop directives

In addition to the data mapping directive, we have added a new loop scheduling clause, ``affinity'', to schedule the iterations of a loop onto threads associated with the data mapping.

Syntax in Fortran:
  affinity_schedule_clause = schedule(affinity,affinity_target)

affinity_schedule_target := array_name | array_name(align_subscript,...)
affinity_schedule_subscript :=  expr |*

Syntax in C:
  affinity_schedule_clause = schedule(affinity,affinity_target)

affinity_schedule_target := array_name | array_name[affinity_schedule_subscript]...
affinity_schedule_subscript :=  expr |*

expr must be a expression in the form of scale*identifier+offset. In the current implementation, scale in the expression is not supported. In the following example, the iterations are assigned to the processor having the array element a[i][*] in the following code:

#pragma omp for schedule(affinity,A[i][*])
  for(i = 1; i < 99; i++)
   for(j = 0; j < 200; j++)
     A[i][j] = ...;

Note that, in the current implementation, mapping and loop scheduling for only one of the dimensions can be specified because our current OpenMP compiler supports single level parallelism.

Omni/SCASH shared memory allocator "ompsm_galloc"

In C programs, the memory allocator "malloc" is used. In Omni/SCASH, the standard memory allocator "malloc" allocates local memory, which cannot be shared. In order to allocate heap space in the shared memory of SCASH, You can use "ompsm_galloc" instead of "malloc".

#include 
void * ompsm_galloc(int size, int map, int arg);

The function allocates the heap memory of size bytes in the SCASH shared memory space, and return its address. The argment map specify how to map on the processors.

OMNI_DEST_NONE = 0, map each pages in round-robin manner to the procesors. This is the default mapping of SCASH.
OMNI_DEST_BLOCK = 1, map the allocated space in the block distribution. The space is equally divied, and mapped to each processor.
OMNI_DEST_DIRECT = 2 , map the allcoated space to the processor specified by the argument arg.

If failed, NULL is returned. Maximum of total global heap size deside by OMNI_SCASH_HEAP_SIZE.

Omni/SCASH Environment Variables

In addition to Omni environment variables, the following environment variables controls Omni/SCASH memory parameter:

OMNI_SCASH_ARGS_SIZE: Specifies the argument memory size of each thread. The size is in bytes as default. You can specifies in the unit of kilo using "k" and the unit of mega using "m". For example, "1k" means 1024 bytes. If program output "shared arg stack overflow" message, should be set larger size to OMNI_SCASH_ARGS_SIZE. OMNI_SCASH_ARGS_SIZE is 4K byte as default.
OMNI_SCASH_HEAP_SIZE: Specifies the global heap size The size is in bytes as default. You can specifies in the unit of kilo using "k" and the unit of mega using "m". For example, "1k" means 1024 bytes. Program can allocate global heap memory by ompsm_galloc. OMNI_SCASH_HEAP_SIZE is 4K byte as default.