Asynchronous and event-driven PyOpenCL programming¶
by Tomasz Rybak
- tomasz.rybak@post.pl
- Debian Maintainer of PyOpenCL and PyCUDOA
- Currently working at CodiLime
- Worked at University of Geneva
Description¶
OpenCL is the library, API, and programming language intended to help with performing computations on different computing devices like ordinary CPUs, graphical cards (GPU), specialized chips or FPGAs. OpenCL provides different profiles offering various capabilities (e.g. kernel compilation during runtime, executing native binary code, embedded function libraries) to allow to support different device types. Programming GPUs in Python is easy thanks to PyOpenCL (and PyCUDA). Not everything offered by OpenCL can be used in Python though, because OpenCL is defined assuming usage of the C language. Some functionalities, like calling function in response to event, require providing pointer to C function; fortunately such requirements show themselves only in the most sophisticated use cases. PyOpenCL helps with achieving high performance through asynchronous event-driven programming by allowing us to use many queues and many devices and by mixing synchronous and asynchronous calls. We can create quite sophisticated computation workflow and OpenCL will take try to use the available hardware, e.g. by concurrently call code and transfer data at the same time. New OpenCL versions allow for splitting one physical device into many logical ones (fission) which can be used to reserve some computing capabilities for usage in time-sensitive manner. We can also attach many devices to once shared context which allows to write code performing different tasks and computations in parallel. Some of the features offered by PyOpenCL are similar to those present in Python. Performing asynchronous computations on GPUArray and retrieving results later is similar to Python’s Futures. So far it is impossible to retrieve Futures from GPUArray (to integrate GPU and CPU computing) but this seems to be the case of missing functionality, not incompatibility preventing it from happening. I want to show that creating programs performing quite sophisticated computations might be easy thanks to Python and PyOpenCL. I would also like to start discussion about current PyOpenCL limitations and to get feedback from PyOpenCL users.
Increasing hardware parallelism¶
- More’s law, increasing transistor density
- Power wall
- Chip’s frequency doesn’t increase anymore
- We get more cores instead
- No more automatic performance improvements
- Different programming models
- OpenCL has emerged as a standard intended to help with programming over this obstacle.
Summary: Use OpenCL to access the power of graphics cards as math processors
OpenCL¶
Standard maintained by Khronos
Similar to OpenGL
- Extensions
- Different models for different devices
Compile dor binary kernels run on cores separate from CPU
Based on C
Includes events and asynchronous execution
Information
Basic OpenCL programming model¶
Execution units hierarchy
- Hosts
- Platforms
- Computing devices
- Computing units
- Processing elements
Memory hierarchy
- Global memory
- Constant memory
- local memory
- Private memory
Relaxed consistency of memory access
Cache
Execution run-time hierarchy¶
Context
Queue
Work-group
- A bunch of threads go into a work group
- Which means you can have 100 threads run in a group, or 1000.
Work-item
Execution Models¶
Task parallelism
- One thread running computations
- Possibility of running many threads at the same time
- Require out-of-order queue or many queues
Computation parallelism
- Many
TODO - Get the parts I missed
PyOpenCL¶
… and PyCUDA
Python wrapper for OpenCL
Not only wrapper
- Pythonic
- Object oriented
Stable but still work in progress
- extensions
- high level programming
OpenCL programming workflow¶
- Compile kernels
- Prepare data
- Transfer data to device
- Run computations
- After finishing computations get results from device
- Free resources
Event based programming done in Python¶
- Instruct OpenCl to run computations
- Don’t wait for data
- Computation will get to you when it’s done
event = pyopencl.enqueue_copy(queue, a, agpu)
event.wait()
event = program.increase(queue, a.shape, None, a_gpu)
# later code
queue0 = pyopencl.CommandQueue(context)
queue1 = pyopencl.CommandQueue(context)
event = pyopencl.enqueue_copy(queue)
Fission¶
- Splitting one physical device into many logical ones.
- Can be used to reserve some computational power
- Solution similar to CPU virtualization
- No problems with device-to-device memory transfers
Where PyOpenCL helps¶
Provides:
- Array
- Random number generators
- Single pass element-wise expressions
- Reduction
- Parallel scan
Designed so you aren’t writing C code from scratch all the time to make your computations work fast in the graphics cards.
Extensions¶
All extensions require pointers to the C so it’s tricky to make them work
OpenGL¶
Can share data between OpenCL and OpenGL
Future of PyOpenCL¶
Intention to share code between PyOpenCL and PyCUDA
Increase number 3rd party libraries
Some of those could be added to PyOpenCL
Resolving existing problems
- Adding extensions should be easier
- Supporting additional libraries
Suggestions¶
Check out http://copperhead.github.com as a way to wrap PyCUDA for easier coding.