---------------------------------- CASITA 1.9 ----------------------------------

MPI
- do not allow to assign more waiting time to MPI_Wait[all] than its duration

OpenMP
- support different number of threads in different OpenMP regions
- fixed detection of OpenMP directives (ignore file name)

Offloading
- added rule to blame the host for not keeping the device busy
- refined consecutive transfer detection

MISC
- distinguish between different types/reasons of blame
- improved implementation of blame distribution (only use edges to store blame)
- avoid writing of zero values for the OTF2 blame counter
- write blame and waiting time in seconds instead of ticks
- simplified writing of the critical path
- added units for critical path, waiting time, and blame in OTF2 output
- improved rating output for long region names
- the rating output distinguishes between host and device functions
- corrected and beautified pattern summary output
- generate summary file
- added Score-P user instrumentation to analyze CASITA's performance
- reduced memory footprint by removing members from Node and Edge classes
- improved merging of activity groups (program regions)


---------------------------------- CASITA 1.8 ----------------------------------

MPI
- added support for custom MPI communicators
- fixed MPI_Sendrecv rule
- start local critical path analysis during waiting time in the detection of
  critical MPI nodes

Offloading (CUDA/OpenCL)
- share analysis rules between CUDA and OpenCL (avoid code duplication)
- optionally write device idle and/or compute idle events to the output trace
- added dependencies between non-overlapping kernels on different device streams
  (improved critical path detection on a device)
- avoid intermediate analysis start when OpenMP parallel regions or offload
  kernels are active
- added handling of CUDA null stream

MISC
- added summary output on inefficiency patterns for MPI, OpenMP, and offloading
- reduced CASITA's memory overhead (less edges are created, different stream
  objects for MPI, OpenMP, and device streams, etc.)
- added function filter for trace output
- improved verbose output


---------------------------------- CASITA 1.7 ----------------------------------

MPI
- try to ignore MPI_I[send|recv] regions without communication information 
  (CASITA aborted before)
- fixed blame distribution in MPI_Recv rule
- added check for send-receive overlap in MPI_Send rule

OpenMP
- added critical path detection and blame distribution for OMPT-based OpenMP 
  traces (Score-P branch TRY_TUD_mic_offloading)
- use OTF2 location definitions to detect whether the trace contains OpenMP 
  events

CUDA
- improved handling of CUDA events and CUDA kernels over interval boundaries
- fixed handling of late CUDA synchronization
- corrected removal of kernel launch nodes at interval boundaries
- fused BlameKernelRule, BlameSyncRule, and LateSyncRule
- avoid CUDA analysis if input trace does not meet the requirements 
  (CUDA stream references)
- add CUDA event rules only if cuEventRecord occurs in the trace

OpenCL
- avoid OpenCL analysis if input trace does not meet the requirements 
  (OpenCL queue references); previous check was broken

MISC
- activity impact rating is now based on the new metric "Blame on CP"
- removed unused code
- added option to detect circular loops in process-local critical path analysis


---------------------------------- CASITA 1.6 ----------------------------------

MPI
- added late sender/receiver rules for wait states in MPI_Wait and MPI_Waitall 
  (generate graph nodes for enter events of non-blocking MPI routines again)
- added blame distribution for MPI_Wait and MPI_Waitall later sender/receiver 
  pattern
- handle MPI_Test and MPI_Testall to close MPI requests and remove respective 
  internal data
- simplified critical path detection for MPI (reduced memory and communication 
  overhead)
- fixed detection of the last MPI node in an interval
- use collective rule for all collectives (one-to-all, all-to-one, barrier, etc)


---------------------------------- CASITA 1.5 ----------------------------------

MPI
- handle enter events of non-blocking MPI routines as CPU events to reduce 
  memory overhead
- fixed the detection of the MPI root process for one-to-all communication
- handle MPI all-to-one communication with MPI collective rule instead of the 
  broken one-to-all rule

OpenMP
- fixed critical path detection for OpenMP programs
- reduce CASITA memory footprint by handling OpenMP events other than parallel 
  and fork/join as CPU events
- prepared OpenMP support for nested OpenMP parallel regions

CUDA
- avoid deletion of CUDA nodes at the end of analysis intervals, if nodes are 
  potentially used afterwards

OpenCL
- added support for OpenCL synchronization routines

Misc
- fixed blame distribution for all supported paradigms
- dramatically reduced runtime of local critical path detection
- added command line option to replace CASITA output files
- improved debugging output and documentation
- removed revision number in CASITA versioning


--------------------------------- CASITA 1.4.1 ---------------------------------

- corrected computation and visualization of the begin and end of the critical 
  path
- added flag to specify the number of top rated functions
- fixed several bugs in the MPI critical path analysis output
- boost is not required any more
- added csv output for optimization rating
