Python is a dynamically typed, interpreted scripting language. Out of the box, this can be slower, but that is easy to overcome. You can do more with less code, programming in high level languages is faster. Since the installation of the interpretor handles machineand operating system dependencies and quirks, you can write code once that will run anywhere. The computational bits can also be refactored into lower level languages like Fortran and C.
Numba does just-in-time (JIT) compiling on Python code to make it as fast as native C in many cases. This approach is good for pieces of code that run repetitively.
There a several ways to accelerate code. This accomplished through parallelization in shared or distributed memory setting. Graphical processing units (GPUs) can accelerate jobs further. At the application layer, methods are encapsulated so that they can be executed as a batch of jobs across the processors on one or more machines
Python uses processor parallelism instead of thread parallelism. When processes need to communicate, they use Message Passing Interface (MPI). Use the common C Pi example to see if MPI is installed:
mpicc call-procs.c -o call-procs -lm mpicc C_PI.c -o C_PI -lm mpicc MPI_08_b.c -o MPI_08_b.c -lm mpiexec -n 1 call-procs
This will need to be working before Python bindings to MPI will work.
When parallelism expands outside of a single machine or hsared-memory space, we start calling it distributed computing. For Python, this is a better description than I can come up with off the cuff. There are two approaches to distributed computing: centralized and decentralized. Centralized is simpler, and will be discussed here. For ecological simulations, there will generally be some kind of controller that schedules the various tasks needed during each time step.
Say you are calculating particle trajectories, and have a NetCDF output from FVCOM, but it doesn’t include dispersal terms for whatever reason. The initial process is a job server/manager, which distributes work across machines and processes. The manager would do mesh partitioning, and hand off topology to the other machines. Time loops are handled by the server processes, which use asynchronous communication. Each job is for a single node in a mesh. The processes take initialization info, then wait for a forcing message. They take a integration step for each message, and report back the state variables.
For each time step, the machines would request a JSON of physical fields from the manager, and load it into memory. Then divide node/volume calculations among available processes. Once all the field calculations are complete, then you’d setup a particle queue, and have multiple processors do interpolation and advection until all particles are in the update queue. A particle leaving the local domain would go to a transfer queue, to be sent to another machine. Basically, the serial program needs to be broken down into a process manager, and self-contained kernels that have limited dependencies.
For program “Happiest particle”, if we assume that a particle can never jump onto land, the procedures might be something like,
def A(array): """Make some calculations across all nodes""" result = fcn(array) return result def B(position, topology): """Get interpolated value at position""" result =  for node in topology: result[node] = fcn(array, topology) return result def C(positions, velocities, stage): """Runge-kutta integration""" positions += velocities * fcn(stage) def D(array): """find maximum"""
The psuedocode for the simulation loop would look something like,
Setup() While(True): Map A(A) > processes collect results Map B > machines > processes collect results for step in integration: Map C > machines > processes collect results transfer particles Map B > machines > processes collect results Map D > processes > reduce > alert if (condition): print("All particles are ecstatic!") break Output() Exit()
What you really want is a Docker swarm to manage all that for you. Or more likely, Kubernetes. The idea being there is one “server” per processor, which acts as an addressable virtual machine. This does away with machine/processor distinction.
Job stream can also distribute jobs across hybrid clusters:
CPLUS_INCLUDE_PATH=~/berryconda3/envs/oceanics/bin/boost pip\ install job_stream LD_LIBRARY_PATH=~/my/path/to/boost/stage/lib/ \ pip install job_stream
Job stream uses the Boost library, which can cause some installation challenges.
Another option is Ray:
sudo apt-get install -y cmake pkg-config build-essential autoconf curl libtool unzip flex bison python pip install cython conda install libgcc pip install git+https://github.com/ray-project/ray.git#subdirectory=python
The GPU can act as an additional worker. Programming for the GPU is accomplished with compute shaders, which replace the traditional rendering pipeline.
There on instructions online for optimizing the Videocore GPU in Raspberry Pi, but it’s probably not worth the effort.
Arrayfire is an open source optimizer for CUDA and OpenCL on GPUs and CPUs, that has extensions for Fortran and Python. On MacOS, there is an installer package. After this has been dowloaded and executed, the Python package can be installed with:
pip install arrayfire line_profiler memory_profiler objgraph numba
The basic array creation functions and types are in arrayfire.data.