Sammendrag
The maximum flow problem is a core graph problem and algorithms for solving this problem has evolved
since Ford and Fulkerson published their solution in 1956. It has applications in several graph flow
problems, graph cuts, scheduling, and many more. These applications led to the development of preflowpush
algorithms. These have several variations with different complexities, and are amenable to
parallelization. Early efforts in parallelizing preflow-push algorithms needed locks to prevent errors, while
recent research has led to lock-free algorithms that are much faster.
Multi-GPU systems are becoming more common and powerful. This has allowed the research into the
usage and programming of such systems to increase. We introduce what is to our knowledge the first
implementation of a maximum flow algorithm targeting multiple GPUs, and identify and discuss the
suitability of implementing maximum flow on multiple GPUs.
Our implementation is based on the asynchronous multithreaded algorithm by Hong and He, and we use
the asynchronous multi-GPU framework Groute by Ben-Nun, Sutton, Pai, and Pingali as a base. We use
the distributed worklist supplied by Groute to manage our list of active items, and CUDA managed memory
for host to device, and device to device communication. The implementation is benchmarked on an NVIDIA
DGX-2 machine using genrmf graphs which are commonly used for benchmarking max-flow algorithms.
Our multi-GPU version may solve bigger problems compared to single GPU implementations due to the
increased memory capacity. Even with this increase in memory capacity and the large amount of available
threads it does not beat a simple single threaded CPU implementation. The implementation may also
produce wrong results. We propose ways to combat both the run time issues, and the correctness issues.
We nevertheless managed to develop and implement a multi-GPU max flow algorithm which ran correctly on most of the benchmarks we tested. Our results illustrate how challenging implementing irregular
algorithms efficiently on multiple GPUs.
Vis fullstendig beskrivelse