Notes on the Use of
CAPTools-based Automatic Parallelizer using OpenMP (CAPO)


Version 1.0Beta

March, 2000

CAPO Development Team
NASA Ames Research Center
M/S T27A-2
Moffett Field, CA 94035-1000

Please send any feedbacks on CAPO to:
capo@nas.nasa.gov

1. General Information

CAPO (Captools-based Automatic Parallelizer using OpenMP) automates the insertion of compiler directives to facilitate parallel processing on shared memory parallel (SMP) machines. While CAPO is currently integrated seamlessly into CAPTools (developed at the University of Greenwich), CAPO is independently developed at NASA Ames Research Center as one of the components for Legacy Code Modernization (LCM) project. The current version (Version 1.0) takes serial FORTRAN programs, performs data dependence analysis, and generates either SGI's native or OpenMP directives. A graphic user interface (Directives Browser) is designed to assist the process of directives generation. Due to the widely support of the OpenMP standard, the generated OpenMP codes can potentially run on a wide range of SMP machines.

The success of CAPO relies on accurate interprocedual data dependence information which is currently provided by CAPTools. CAPO generates compiler directives in three stages:

  1. identification of parallel loops in the outer-most level,
  2. construction and optimization of parallel regions around parallel loops, and
  3. insertion of directives with proper information on private, reduction, induction, and shared variables.
Attempts have also been made to identify potential pipeline parallelism (implemented with point-to-point synchronization). User is still expected to inspect the generated code before actual execution. The Directives Browser can be used for interactive inspection after stages 1 and 2.

This writeup is not a user manual. We simply hope to provide the beta testers some basic information to try out the tool. Any feedbacks can be sent to the address above.

For more information on CAPTools, check the web site at http://captools.gre.ac.uk/.
For more information on the LCM project, check http://www.nas.nasa.gov/Groups/Tools/Projects/LCM.

For major changes in different versions of CAPO, see WhatsNew.CAPO. More documents can be accessed from the web page at http://www.nas.nasa.gov/Tools/CAPO.

2. Using CAPO

Since CAPO was integrated into CAPTools, a user needs to know about CAPTools. For the installation and use of CAPTools, please refer to the web site at http://captools.gre.ac.uk/. A test license may be obtained from the web site or by sending email to captools@gre.ac.uk. We assume that the user already have CAPTools properly installed. We only discuss the relevant information on the use of CAPO in this write-up.

The executable of CAPO in this distribution is in

      captool/bin/{machine}/capo
where {machine} is sgi for SGI workstation and sun for SUN workstation.

2.1 Prepare serial FORTRAN codes

CAPO currently works on FORTRAN 77 codes. Data dependence analysis is performed on the whole program. A user can either create a single file that contains all the subroutines or provide a .list file that lists all the FORTRAN files in the program. Any unresolved symbols (except for intrinsic functions) can be provided with dummy routines. For example, if the FORTRAN program calls C subroutines, dummy FORTRAN routines could be supplied to emulate the C functions even through these dummy routines may be deleted later on from the generated parallel code. This was a requirement of CAPTools prior to Version 2.1Beta-010. The latest CAPTools provides interfaces to the dummy routines automatically.

2.2 Make dependence analysis

Before inserting any directives, it is necessary to perform dependence analysis on the serial code. Either all the .f files or a .list file describing the location of all the relevant files must first be loaded into CAPTools. Then, user knowledge may be entered, typically for variables in the READ statements, as needed before performing the dependence analysis. Depending on the program size and the thoroughness of the analysis specified, this process can take hours or days to complete. Once the analysis is finished, the user should save the result to a database before proceeding further.

2.3 Inspect directive insertion

An important step in CAPO involves inspecting the dependences produced by CAPTools. The information could be overwhelming. But the least one can do is to go through all the serial loops identified by CAPTools in the loop browser. Quite often, a dependence causing a loop to be serialized is due to insufficient knowledge of value limits for some variables. The user can use the dependence browser to remove unnecessary dependences. One should also check for Input/Output dependences as they could cause some privatizable variables not listed.

A better approach for inspecting the loops is to use the Directives Browser implemented in CAPO (see Section 3 for details). The browser can be activated from the "View/Directives" menu and is designed to display information that are directly relevant to directives insertion. For instance, The browser provides more interactive information on the reasons for loops to be parallel or serial. The user can concentrate on loops that are indicated as serial and manipulate the dependence graph if needed. This is an iterative process. It is always a good idea to save the result to a database whenever a change is made before directives are inserted.

2.4 Generate parallel code with directives

Once the dependence analysis is completed and the loop information are inspected, directives can automatically be inserted by selecting the "Save OpenMP Directive Code" option under the File menu. The type of directives is controlled by the CAPO parameters as described in Section 4, which are also selectable from the Setting box in the Directives Browser. One can elect to use the default setup, which is to produce OpenMP directives with full range of analysis. Steps in the generation of directives are logged to a log file, by default to "code-output.log". Contents of the log file is described in Section 5.

2.5 Inspect the generated codes and the log information

It is very important to inspect the generated parallel code, together with the log information. In particular, one should look into any potential incorrectly-listed shared and private variables. Warning messages in the last section (PASS 3) of the log file can indicate places where potential problems might exist. Of course, one can use other tools (such as ASSURE from Kuck & Associate) to check for problems.

2.6 Compile and run the parallel code

On the SGI Origin2000, use the "-mp" option to compile the directive parallel code. To run the code (on 8 CPUs, for example), do
      % setenv OMP_NUM_THREADS 8
      % a.out (or your_program_name)

3. The Directives Browser

The Directives Browser (from the View menu) can be used to assist the inspection of directive insertion. The browser activates after CAPO finishes the directive analysis in the first two steps (loop and region analyses) and provides more interactive information on the reasons for loops to be parallel or serial. User can concentrate on loops that are indicated as serial (fully or covered, as given below). Dep-Graph Browser can then be used to manipulate the dependence graph.

3.1 Loop filters

Loops in the Directives Browser are classified into the following types with selectable filters and sub-filters so that the user can narrow down to a particular type:
- Totally Serial (TS)
   Loop itself is serial (with loop-carried true dependence, I/O 
      statements, Exit statements, or number of iteration less than 2)
      AND not within or containing any parallel loops.
      'Exit' statements refer to statements that jump out of the loop via,
      such as, GOTO and RETURN.
   Sub filters:
      True Recursion  - loop-carried true dependence, but containing
                        no I/O or exit statements
      I/O or Exit     - I/O and/or exit statements, and with true dependence
      No Granularity  - loop with one or less iteration, or use of string
                        range (:) in a possible pipeline loop

- Covered Serial (CS)
   Loop itself is serial (loop-carried true dependence, I/O or Exit
      statements) AND within or containing parallel loops.
   Sub filters:
      True Recursion  - loop-carried true dependence, containing parallel loops
      I/O or Exit     - I/O or exit statements, containing parallel loops
      Inside Parallel - inside parallel loops

- Falsely Serial (FS)
   Loop itself has no loop-carried true dependence and no exit statements
      AND not within any parallel loops, but may contain parallel loops.
   Sub filters:
      Privatization   - loop-carried anti or output dependence, no I/O 
                        statements, and/or with non-privatizable variables.
      I/O Statement   - I/O statements, without true dependence
      No Granularity  - no granularity or use of string 
                        range (:) in a possible parallel loop

- Reductions (RD)
   Loop with reductions. Symbol with '()' indicates an array.

- Pipeline (PP)
   Loop could be used as part of a parallel pipeline.

- Chosen (CS)
   Loop is a parallel loop other than reduction and pipeline loops
      AND not within other parallel loops.
   Sub filters:
      Normal          - containing no copyin/out variables
      CopyIn/Out      - with copyin/out variables
      Ordered         - with 'ordered' variables, for example, scalars
                        assigned in an IF statement and not privatizable

- Not Chosen (NC)
   Loop is parallel, but not chosen due to other chosen parallel loops.
      The loop is either inside or containing parallel loops.
   Sub filters:
      Inside Parallel - inside other parallel loops (excluding I/O loop)
      I/O Statement   - possible parallel I/O, inside or containing 
                        parallel loops
      No Granularity  - no granularity, but containing parallel loops

For all cases -
   Sub filters:
      User Defined    - as user defined loop type
   Show Parallel I/O:
      Yes - show loops with parallel I/O in the Sub filters
      No  - treat loops with I/O as serial
Usually one wants to go through the following loop types:
   - Totally Serial->True Recursion
   - Covered Serial->True Recursion
   - Falsely Serial->Privatization
   - Chosen->CopyIn/Out
and use the Why window to find out the reason for a particular loop type.

3.2 The "Why" window

The Why window (WhyDirectives browser) can be activated by clicking on the "Why..." button in the Directives Browser window once a loop is selected. The window display information on variables that cause a loop to be classified, such as serial.

The cause for a loop not to be parallel can come from several sources, for example, loop-carried TRUE/ANTI/OUTPUT dependence, non-privatizable variables (reuse of memory). If one is sure that some of these dependences are false (mostly due to lack of input information for the dependence analysis) and can be removed, the Dep-Graph browser can be used. A shortcut is provided in the Why window where variables can be selected from the Var-List boxes and the relevant dependences can be removed by clicking the 'Remove' button. The following relevant dependences will be removed, based on the loop/variable type:

   Loop-Type        Var-List	Dependence-Type
   --------------------------------------------------------------------
   Totally Serial   True-dep	Loop-carried TRUE dependence
                    Anti-dep	Loop-carried ANTI dependence
                    Output-dep	Loop-carried OUTPUT dependence
   Covered Serial   True-dep	Loop-carried TRUE dependence
                    Anti-dep	Loop-carried ANTI dependence
                    Output-dep	Loop-carried OUTPUT dependence
   Falsely Serial   Anti-dep	Loop-carried ANTI dependence
                    Output-dep	Loop-carried OUTPUT dependence
                    In/Out-dep	TRUE dependence from outside of the loop
   Chosen Parallel  Copyin/Out	TRUE dependence from outside of the loop
Once a change to the dependence graph (either via the Dep-Graph browser or via the WhyDirectives browser) is made, be sure to save the change to the database (File-> Save Database) and re-perform the directive analysis ("Update Directives..." button).

3.3 The Setting dialog box

The Setting dialog box (from Edit-> Directives_Setting or View-> Directives-> Setting) can be used to reset parameters for CAPO as described in Section 4.

3.4 New loop type

A loop type (given in Section 5.1) can be overrided by user with the "LoopType" dialog box activated from the WhyDirectives-> New_Type button. Currently only four types are selectable:
   Parallel    - from parallel without granularity
   Serial      - from parallel loop, including reduction
   Reduction   - from parallel loop or serial loop with loop-carried
                      true dependence
   Break       - from any other cases
Only the conversions as indicated are possible from the dialog box. Although loop types can be redefined from the user-defined loop file, use of the LoopType dialog box is safer. However, one should keep in mind that changing the loop type manually could potential lead to incorrect results if the above rule is not followed carefully.

3.5 Routine duplication browser

The RoutineDup Browser (from View-> Directives-> RoutDup) is used for browsing routines which will be duplicated to avoid nested worksharing directives. These routines are called both inside and outside parallel loops and contain parallel loops themselves. The browser will indicate those calls that are inside parallel loops and those that are outside parallel loop.

There are two selectable types of routine duplication (rdup):

   - 'Loop Usage' as the (default) type for rdup if a routine is used 
      both inside and outside parallel loop(s).
   - 'Region Usage' as the type for rdup if a routine is used inside a 
      parallel loop and inside parallel region but outside parallel loop.
The second option confirms the OpenMP standard that a parallel region can be nested inside a parallel loop but not inside a parallel region.

4. Parameters for CAPO

The following describes parameters available in Version 1.0.

4.1 General

Parameters are referring to inputs that user can supply to control the behavior of directive generation in CAPO. There are default settings for all the parameters (see 4.3). Parameters can be defined from a file, environment variables, or the Setting box in the Directives Browser. Values from a parameter file or environment variables supersede any defaults. Values from the parameter file supersede environment variables. Changes from the Setting box in the Directives Browser are applied last.

4.2 The parameter file

The parameter filename can be defined via environment variable CAPO_PAR. The default filename is "capo-inp.par" in the current directory. See 4.5 for an example of this file.

Format of this file:

   '#' sign           starts a comment
   'key value' pair   defines an entry

4.3 Available keys and possible values

   ENV_VARIABLE   KEY		    DEFAULT      POSSIBLE VALUES
   CAPO_PAR       		    capo-inp.par
   CAPO_LOG       log-file	    on	         (off on stdout)
   CAPO_LOGNAME   log-file-name	    codeoutput.log
   CAPO_LOGINFO   log-info	    std	         (min std more debug)
   CAPO_PLOOP     loop-granularity  6	         (0 1 2 ...)
   CAPO_TYPE      directive-type    omp	         (omp sgi sgix)
   CAPO_REGION    region-type	    default      (loop bloop one join full)
   CAPO_OPTIMIZE  optimize-type	    o2	         (off on o2)
   CAPO_USERLOOP  user-loop-file    user-loop.par
   CAPO_DIRCLEAR  directive-clear   default-list (off on filename)
   CAPO_TPRIV     tpriv-directive   on	         (off on)
   CAPO_COMMENT   comment-type	    f90	         (f77 f90)
   CAPO_USEPARTI  use-parti-loop    no	         (no yes)
   CAPO_ORDERED   ordered-directive off	         (off on)
   CAPO_RDUPTYPE  rdup-type	    loop         (loop region)
Notes:

4.4 Parameters for debugging purpose

The following parameters are only available from the Setting box in Directives browser. By default, all these parameters are enabled. The Setting box can be used to disable them for debugging purpose.
   Generate-NOWAIT   	      - enable/disable NOWAIT directive
   Transform-Induction-Loop   - enable/disable induction loop treatment
   Handle-Array-Reduction     - enable/disable array reduction
   Remove-Old-Directives      - enable/disable removing old directives
   Apply-UserLoop-Type	      - enable/disable applying userloop types
   Setup-Pipeline-Loop	      - enable/disable pipeline loop

4.5 Sample parameter file

# env: CAPO_PAR
# Parameters for CAPTools-based Parallelizer with OpenMP (CAPO)
# They apply to version 1.0

# env: CAPO_LOG
# defines if log-information is wanted
log-file		on	(off on stdout)

# env: CAPO_LOGNAME
# defines log-file name when log-file = on
log-file-name			(default: codeoutput.log)

# env: CAPO_LOGINFO
# defines type of information to be logged
log-info		std	(min std more debug)

# env: CAPO_PLOOP
# defines granularity (min. no. of iters.) for parallel loops
loop-granularity	6	(0 1 2 ...)

# env: CAPO_TYPE
# defines type of directives to be produced
directive-type		omp	(omp sgi sgix)

# env: CAPO_REGION
# defines type of parallel regions to be considered
region-type		full	(loop bloop one join full)

# env: CAPO_OPTIMIZE
# defines optimization type for parallel regions
optimize-type		o2	(off on o2)

# env: CAPO_USERLOOP
# defines the file name for user-defined loop types
user-loop-file                  (default: user-loop.par)

# env: CAPO_DIRCLEAR
# defines the file name for directives to be cleared
directive-clear         Default (off on filename)

# env: CAPO_TPRIV
# switches on/off the generation of THREADPRIVATE
tpriv-directive         on      (off on)

# env: CAPO_COMMENT
# chooses a comment type for directives
comment-type         	f90     (f77 f90)

# env: CAPO_USEPARTI
# uses partitioned loops for directives
use-parti-loop       	no      (no yes)

# env: CAPO_ORDERED
# creates ORDERED code section
ordered-directive       off     (off on)

# env: CAPO_RDUPTYPE
# defines routine duplication type
rdup-type               loop    (loop region)

5. Information Generated in the Log File

By default, the process of automatic insertion of directives is logged to the log-file "code-output.log". Information in this file should be examined after directives are added. There are three main sections in the log file, as outlined in the following subsections. Depending on the log-info type as described in Section 4, different levels of information details may be logged. In general, the log-info type controls:

min - only minimum amount of information, such as WARNING messages, and INFO messages
std - information from min, plus summary for each routine and each region
more - information from std, plus more detailed results for each loop and each region
debug - information from more, plus additional debug information that are probably too much for ordinary user.

In the case of "more" and "debug", additional labels (region# and loop#) are added as comments for parallel loops in the generated parallel code. Regions and loops are labeled within a given routine, sequentially.

5.1 Classification of loops

The first section lists the analysis of loops in all routines from the dependence information. For a given routine a loop is labeled with its sequence number, the first-level group number, and the loop-nesting level. Loops are classified as parallel, serial, or possible pipeline. For a parallel loop, it is further tested for granularity and is indicated if a parallel directive is to be added, provided the loop is not nested inside another parallel loop. For a serial loop, the reason of serialization as well as the first variable that causes the loop to be serialized is given. The causes of loop serialization include loop-carried true dependence, I/O statement inside, and breaking out of loop. A pipeline loop is a serial loop with only loop-carried true dependences and determinable dependence vectors. The basic information for loops is as the following:
Routine: ROUTINE_NAME
  Loop # (loop_variable), group #, level #: parallel/serial
       TYPE? Reason for serial...
"TYPE?" is one of types from the loop type list:
   "REDU", "NPAR", "PAR", "IO", "LVAR", "SER", "ANTI", "PIPE",
   "BRK", "UPIPE", "PAREG", "INDU", "INPLP", "RDINP", "GRAN", "PARTI"
As an example, part of the analysis for three routines in LU is given here (with log_info set to MORE).
Routine: BUTS
 Loop 1 (J), group 1, level 1: parallel, granularity - ok
        PAR-> directives to be added for the loop <1,1>
 Loop 2 (I), group 1, level 2: parallel, granularity - ok
        INPLP? no directive, loop inside a parallel loop
 Loop 3 (M), group 1, level 3: parallel, granularity - no
 Loop 4 (J), group 2, level 1: serial
        PIPE? true dependence, pipeline loop? dvector: V[0,0,-1,0]
 Loop 5 (I), group 2, level 2: serial
        PIPE? true dependence, pipeline loop? dvector: V[0,-1,0,0]
 Loop 6 (M), group 2, level 3: parallel, granularity - no
 Loop 7 (M), group 2, level 3: parallel, granularity - no
 *** Total number of loops: 7, parallel: 5, serial: 2, directive: 1
Routine: JACU
 Loop 1 (J), group 1, level 1: parallel, granularity - ok
        PAR-> directives to be added for the loop <1,1>
 Loop 2 (I), group 1, level 2: parallel, granularity - ok
        INPLP? no directive, loop inside a parallel loop
 *** Total number of loops: 2, parallel: 2, serial: 0, directive: 1
...
Routine: SSOR
 Loop 1 (I), group 1, level 1: serial
        ANTI? loop carried output or non-exact anti dependence: ELAPSED
 Loop 2 (I), group 2, level 1: serial
        ANTI? loop carried output or non-exact anti dependence: ELAPSED
 Loop 3 (ISTEP), group 3, level 1: serial
        BRK? break out of the loop or comm-call inside the loop
 Loop 4 (K), group 3, level 2: parallel, granularity - ok
        PAR-> directives to be added for the loop <2,1>
 Loop 5 (J), group 3, level 3: parallel, granularity - ok
        INPLP? no directive, loop inside a parallel loop
 Loop 6 (I), group 3, level 4: parallel, granularity - ok
        INPLP? no directive, loop inside a parallel loop
 Loop 7 (M), group 3, level 5: parallel, granularity - no
 Loop 8 (K), group 3, level 2: serial
        SER? loop carried true dependence: ELAPSED
 Loop 9 (K), group 3, level 2: serial
        SER? loop carried true dependence: ELAPSED
 Loop 10 (K), group 3, level 2: parallel, granularity - ok
        PAR-> directives to be added for the loop <2,2>
 Loop 11 (J), group 3, level 3: parallel, granularity - ok
        INPLP? no directive, loop inside a parallel loop
 Loop 12 (I), group 3, level 4: parallel, granularity - ok
        INPLP? no directive, loop inside a parallel loop
 Loop 13 (M), group 3, level 5: parallel, granularity - no
 *** Total number of loops: 13, parallel: 8, serial: 5, directive: 2

>>>> Grand total: num_routines 25, num_loops 157
           loops: parallel 145, serial 12, directive 30
The label for a parallel loop with directive to be added (PAR->) is given as <level,group> pairs. In the case of a serial loop only one variable is listed for the cause of serialization. For a potential pipeline loop, the dependence vector for the first related variable is given for the corresponding loop, as the case of V[0,0,-1,0] for loop 4 (J) in routine BUTS.

The user-defined loop types are applied after the loop classification. Therefore, it is user's responsibility to ensure the correctness of user-supplied loop types.

5.2 Construction of parallel regions

This section contains first the summary from the pass two analysis of all the routines in the outer-most loop level to decide if directives need to be added in a routine. Routines are traversed on their calling sequences. A <yes> or <no> flag is marked for each analyzed routine to indicate the addition of directives in the routine. A routine may need to be duplicated if it is called both inside and outside a parallel loop and will contain directives in itself.
Routine: ROUTINE_NAME <yes/no/inploop/noploop>

<yes>      - with directives for parallel loops
<no>       - no directives
<inploop>  - routine is called inside a parallel loop
<noploop>  - routine has no parallel loop, but may contain potential 
             pipeline loops
A sample result from the analysis of NPB-LU looks like the following.
Routine: APPLU <yes>
Routine: READ_INPUT <no>
Routine: DOMAIN <no>
Routine: SETCOEFF <no>
Routine: SETBV <yes>
Routine: SETIV <yes>
Routine: ERHS <yes>
Routine: SSOR <yes>
Routine: TIMER_CLEAR <no>
Routine: JACLD <yes>
Routine: BLTS <yes>
Routine: JACU <yes>
Routine: BUTS <yes>
Routine: RHS <yes>
Routine: TIMER_START <no>
Routine: L2NORM <yes>
Routine: TIMER_STOP <no>
Routine: ELAPSED_TIME <no>
Routine: WTIME <no>
Routine: ERROR <yes>
Routine: EXACT <no>
Routine: PINTGR <yes>
Routine: VERIFY <no>
Routine: PRINT_RESULTS <no>
Routine: TIMER_READ <no>
>>> Total routines: 25, checked: 24, with directives: 13
    in/outside ploop: 0, in/with ploop: 0, no ploop: 12
    Total directive loops: 30, effective: 30, in ploop: 0
The last line of the statistics indicates how many loops can be put with directives, how many of them are really added with directives, and how many of them are nested inside other loops with directives.

Next is to construct parallel regions based on the loop information. A parallel region includes at least one parallel loop or pipeline loop with possible basic blocks in the beginning of the loop. No nested parallel loops are considered at this point. Two neighboring regions can be joined together if no codes other than comments or nops exist between the two regions. Individual regions are labeled sequentially within a routine. For each region a number is included in () to indicate the end (or last) region of a joined area (of regions). For disjointed regions, the end region is the same as the region itself. Additional information included for a region are: loops in the region and type of the region. Regions are also summarized for a routine as "region-type-summary".

Region-type:
   one ploop    ­ containing exactly one parallel loop (no pipeline)
   +prev-block  ­ one parallel loop plus any preceded basic blocks
   sub ploop    ­ one or more parallel loops nested at different levels
   pipeline     ­ potential pipeline
   <default>    ­ region with joined neighbors

Region-type-summary:
   DEFAULT      ­ routine contains normal parallel regions
   PIPE         ­ routine is part of a pipeline region
   UPIPE        ­ routine contains potential pipeline regions
Sample outputs from the analysis of NPB-LU:
Region-in-Routine: BUTS
 region-type-summary: UPIPE
 Parallel region 1 (2): loops [1-3]
 Parallel region 2 (2): loops [4-7]
 *** Total number of regions: 2, joined regions: 1
Region-in-Routine: JACU
 region-type-summary: DEFAULT
 Parallel region 1 (1): loops [1-2] one ploop
 *** Total number of regions: 1, joined regions: 1
Region-in-Routine: SSOR
 region-type-summary: DEFAULT
 Parallel region 1 (1): loops [4-7] one ploop
 Parallel region 2 (2): loops [10-13] one ploop
 *** Total number of regions: 2, joined regions: 2
Once the initial regions are determined, routines are then checked for possible pipeline regions across routines. If such a region is identified, the pipeloop limit is checked against all other parallel loops in the same pipeline region for alignment. If a discrepancy is found, a message will be printed out as either "not the same limit" or "low-high limit swapped!". In the first case, the suggested pipeline operation may produce incorrect run-time result and further check of this generated code is needed. In the second case CAPTools automatically swaps the loop limit to ensure the consistence. If pipeline loops are not desirable, set the environment variable CAPO_REGION to "join".

For LU, routines BUTS and JACU were identified to be part of a pipeline region in routine SSOR and information was generated as follows.

Region-in-Routine: BUTS
 region-type-summary: PIPE
 pipeloop: DO J=JEND,JST,-1 (BUTS)
 thisloop: DO J=JEND,JST,-1 (BUTS)
   same limit
Region-in-Routine: JACU
 region-type-summary: PIPE
 pipeloop: DO J=JEND,JST,-1 (BUTS)
 thisloop: DO J=JST,JEND,1 (JACU)
   low-high limit swapped!
Region-in-Routine: SSOR
 region-type-summary: DEFAULT
 Parallel region 1 (1): loops [4-7] one ploop
 Parallel region 2 (2): loops [8-8] pipeline
 Parallel region 3 (3): loops [9-9] pipeline
 Parallel region 4 (4): loops [10-13] one ploop
 *** Total number of regions: 4, joined regions: 4

>>>> Grand total: routines 25, regions 34, joined regions 26
Parallel regions are further optimized for removal of end-of-loop synchronization (use the 'NOWAIT' construct). Although more conservative approach is taken, careful examination of NOWAIT is still needed. For example, one should pay attention to the WARNING messages on 'EndLoop-Sync required/re-enforced'. If any problem occurs, one can always switch the optimization off (setenv CAPO_OPTIMIZE off).

For LU, this is the summary after region optimization:

>>>> Total number of syncs removed: 7, in 4 routines (13 checked)

5.3 Insertion of directives in routines

There are four functions performed in this stage:

- clearing any old directives if CAPO_DIRCLEAR is not off (section 4.3), - searching threadprivate common blocks and inserting the THREADPRIVATE directive if CAPO_TPRIV is not off, - duplicating routines if needed, and - inserting region/loop-level directives.

Information resulted from these four actions are not fed back to the Directives Browser except for presented as directives in the source code. Thus, once directives are inserted, the Directives Browser should not be used to do further changes.

A threadprivate common block is the one that have all its variables used as private (including copyin) for all the parallel regions in the whole program. It means even a single instance of a non-private usage of a variable can prevent the common block from becoming threadprivate. In the debug mode, causes of a common block being determined as thread- private or shared can be examined. See Section 5.4 for details. Normally messages are printed for identified threadprivate common blocks and routines that contain them. An example is given here.

T_PRIV common blocks:
 -/WORK_1D/-18: SP SET_CONSTANTS EXACT_RHS INITIALIZE ADI TXINVR X_SOLVE NINVR
       Y_SOLVE PINVR Z_SOLVE LHSINIT TZETAR ADD VERIFY ERROR_NORM COMPUTE_RHS
       RHS_NORM
 -/WORK_LHS/-18: SP SET_CONSTANTS EXACT_RHS INITIALIZE ADI TXINVR X_SOLVE
       NINVR Y_SOLVE PINVR Z_SOLVE LHSINIT TZETAR ADD VERIFY ERROR_NORM
       COMPUTE_RHS RHS_NORM

>>> THREADPRIVATE directive added for 2 common blocks in 18 routines
Warnings may be printed for those common blocks that potentially be threadprivate:
WARNING! SSOR... region 4, loop 8
	/CJAC/ Type conflict: old SHARED, new PRIV - use SHARED
It indicates that in routine SSOR all variables in common block /CJAC/ are used as private in region 4, but the common block is shared in other places. One can trace further for where the common block is shared in the debug mode.

Directives are added by annotating the call graph and using the parallel region information obtained in 5.2. The call paths are printed as the insertion is progressing. Any routine is only visited one time.

Routine: APPLU
Routine: APPLU->SETCOEFF
Routine: APPLU
Routine: APPLU->SETBV
Routine: APPLU
Routine: APPLU->SETIV
Routine: APPLU
Routine: APPLU->ERHS
Routine: APPLU
Routine: APPLU->SSOR
Routine: APPLU->SSOR->RHS
Routine: APPLU->SSOR->RHS->TIMER_START
Routine: APPLU->SSOR->RHS->TIMER_START->ELAPSED_TIME
Routine: APPLU->SSOR->RHS->TIMER_START->ELAPSED_TIME->WTIME
Routine: APPLU->SSOR->RHS->TIMER_START->ELAPSED_TIME
Routine: APPLU->SSOR->RHS->TIMER_START
Routine: APPLU->SSOR->RHS
Routine: APPLU->SSOR->RHS->TIMER_STOP
Routine: APPLU->SSOR->RHS
Routine: APPLU->SSOR
Routine: APPLU->SSOR->L2NORM
INFO! Array reduction variable replaced with local critical in region 1 -
        SUM() --> SUM_CAP1()
Routine: APPLU->SSOR
Routine: APPLU->SSOR->JACLD
Routine: APPLU->SSOR
Routine: APPLU->SSOR->BLTS
Routine: APPLU->SSOR
WARNING! Potential memory conflict for shared variable in region <2,1> - ELAPSED
Routine: APPLU->SSOR->JACU
Routine: APPLU->SSOR
Routine: APPLU->SSOR->BUTS
Routine: APPLU->SSOR
WARNING! Potential memory conflict for shared variable in region <3,1> - ELAPSED
Routine: APPLU
Routine: APPLU->ERROR
INFO! Array reduction variable replaced with local critical in region 1 -
        ERRNM() --> ERRNM_CAP1()
Routine: APPLU
Routine: APPLU->PINTGR
Routine: APPLU
Routine: APPLU->VERIFY
Routine: APPLU
WARNINGs for "...variable used after a parallel region", "potential memory conflict", and INFOs on the changes made to routine arguments should be examined carefully. These are just warnings, may or may not cause any programming errors. The warnings are the cases where CAPO are uncertain of decision making and user needs to inspect the generated code at the pointed places for verification. The parallel region is labeled as <region_number, parallel_loop_number> pairs in the call path right preceding the warning message.

Meanings of keywords in the WARNING message:

   "variable"        -- a variable used in the current routine scope
   "common-variable" -- a variable used outside the current scope
                        e.g. through COMMON blocks or SAVE statements
                        in a subroutine
   "Shared"          -- variable shared in the current region
   "PLocal"          -- potential private variable in the current region
   "Control"         -- variable with multiple control paths, i.e. variable
                        could be updated either inside or outside the
                        current region
   "I/O statement"   -- routine called inside a parallel region
                        contains i/o (OPEN,READ,WRITE,CLOSE) statements
   "STOP statement"  -- routine called inside a parallel region
                        contains STOP/PAUSE statements
   "Potential memory conflict" -- for shared variable that can cause
                        memory conflict in a parallel region
If a private variable in a parallel region is updated via a COMMON block in a subroutine, CAPO tries to privatize such a variable by adding it to the subroutine's argument list and renaming the original variable in the COMMON block of the subroutine. CAPO will generate the following INFO messages in this process:
   New argument () added to CALL OTHER_ROUTINE():# in ROUTINE_NAME
   New symbol () added to the argument list of ROUTINE_NAME
   Common block /cblk/ duplicated for ROUTINE_NAME
CAPO performs a code transformation automatically for a reduction variable that is an array element. The corresponding message is like:
   Array reduction variable replaced with scalar in region # -
      OLD_ARRAY_ELEMENT --> NEW_SCALAR_VARIABLE

5.4 Debug information

More information will be logged if CAPO_LOGINFO is set to "debug". These are useful for debugging CAPO. Some of the information are included here for reference only.
 - UserLoop information for user-defined loop types
   Userloop: Defined loop # in routine ROUTINENAME - newtype
   The newtype is one of (S, P, R, B) as mentioned in section 4.3

 - List of old directives to be cleared

 - Summary of loop type with list of all dependence vector deltas for
   pipeline loops

 - List of symbols and types in each region

   TYPE
      Private          	 - Local symbol
      Reduction      	 - Scalar reduction variable
      ArrayReduction     - Array reduction variable
      Shared         	 - Shared symbol
      LastPrivate    	 - Usage in & after the region
      FirstPrivate   	 - Usage in & before the region
      CopyInOut          - Shared but no or no proof of loop-var dependent
      ThreadPrivate      - Used in a threadprivate common block
      UnknownType   	 - Type not defined yet

   CONTROL
      No-Control     	   Symbol not in a control dependence
      Control-Dep    	   Symbol in a control dependence

   SCOPE
      In-Scope       	   Symbol defined in the current routine
      Not-in-Scope   	   Symbol not defined in the current routine
      	             	   (defined via common block or save statement)
      Not-in-Use   	   Symbol passed into a subroutine but not used
      	             	   in the subroutine

   DTYPE:DEPTH (printed in [.:.])
      IO      -1, Input/Output
      NT       0, Non-exact True
      NA       1, Non-exact Anti
      NO       2, Non-exact Output
      ET       3, Exact True
      EA       4, Exact Anti
      EO       5, Exact Output
      CT       6, Control
      UN       7, Unknown type
      Depth = 0 for loop-independent dependence

 - List of routine call types, indicating the usage of a routine 
   inside/outside parallel regions/loops.  Five bits are used:
      bit1 [0x01] called outside parallel region
      bit2 [0x02] called inside paregion but outside parallel loop
      bit3 [0x04] called inside parallel loop
      bit4 [0x08] called outside parallel loop (= bit1 | bit2)
      bit5 [0x10] called inside parallel region

 - Information on updating duplicated routines
      Replace call to DROUTINE with CAP_DROUTINE in ROUTINE
      Removed ROUTINE from the calledby list of DROUTINE
      Added   ROUTINE to the calledby list of CAP_DROUTINE

 - List of symbols and affine expressions for testing loop limits
   (such as in the removal of end-of-loop synchronizations)

      HOME (LOOP-VAR-EXPR, #hits) Low <EXPR> High <EXPR> [A1:INDX,A2:INDX..]
           (LOOP-VAR-EXPR, #hits) Low <EXPR> High <EXPR> [B1:INDX,B2:INDX..]
      OTHER (NONLOOP-EXPR, #hits) [C1:INDX,C2:INDX..]
            (NONLOOP-EXPR, #hits) [D1:INDX,D2:INDX..]
      Here <EXPR> is a symbolic expression, A,B,C,D are array names, INDX is 
      the relevant array index.  The lists are for both source and sink.

 - Summary of fields associated with the ploopinfo data struct, mainly
   for development purpose.

       Loop   Lvar    D/L     Type    G WP IP Flag
      Routine: ROUTINE_NAME
          #   var     ?/?     TYPE?   ?  ?  ? [321]

      'Loop' -- the loop number in a routine
      'Lvar' -- the loop variable name
      'D'    -- the 'dlevel' value
      'L'    -- the 'level' value of the loop
      'Type' -- one of type strings given in Section 5.1
      'G'    -- the loop granularity flag (internal info only)
      'WP'   -- '1' containing parallel loop, '0' without parallel loop
      'IP'   -- '1' inside parallel loop, '0' not inside parallel loop
      'Flag' -- three bits for internal usage only

 - Symbols and their types in common blocks (for testing threadprivate)
   Meanings of symbol types:
      [U] - Unset
      [P] - Private
      [R] - Reduction
      [A] - ArrayReduction
      [S] - Shared (RW)
      [s] - Shared (Readonly)
      [L] - LastPrivate
      [F] - FirstPrivate
      [C] - CopyInOut