In this article, we discuss how the fast multipole method (FMM) can be implemented on modern parallel computers, ranging from computer clusters to multicore processors and graphics cards (GPU). The FMM is a somewhat difficult application for parallel computing because of its tree structure and the fact that it requires many complex operations which are not regularly structured. Computational linear algebra with dense matrices for example allows many optimizations that leverage the regular computation pattern. FMM can be similarly optimized but we will see that the complexity of the optimization steps is greater. The discussion will start with a general presentation of FMMs. We briefly discuss parallel methods for the FMM, such as building the FMM tree in parallel, and reducing communication during the FMM procedure. Finally, we will focus on porting and optimizing the FMM on GPUs.
Dans cet article, nous présentons l'implémentation de la méthode multipôle rapide (FMM) sur des calculateurs parallèles modernes, depuis les grappes parallèles aux processeurs multi-cœurs et aux cartes graphiques (GPU). La FMM est une application difficile à paralléliser à cause de la structure d'arbre et le fait qu'elle demande des opérations complexes qui ne sont pas structurées de façon régulière. L'algèbre linéaire computationnelle avec des matrices denses par exemple permet des optimisations qui utilisent les motifs réguliers de calcul. La FMM peut être optimisée de façon similaire mais nous verrons que la complexité de l'optimisation est supérieure. La discussion débute par une présentation générale de la FMM. On discutera brièvement des méthodes parallèles pour la FMM, comme la construction de l'arbre, et la réduction des communications durant le calcul. Finalement, nous présenterons plus en détail le développement et l'optimisation de la FMM sur GPUs.
Mots-clés : Informatique, Méthode multipôle rapide, Calculateur parallèle
Eric Darve 1; Cris Cecka 1; Toru Takahashi 2
@article{CRMECA_2011__339_2-3_185_0, author = {Eric Darve and Cris Cecka and Toru Takahashi}, title = {The fast multipole method on parallel clusters, multicore processors, and graphics processing units}, journal = {Comptes Rendus. M\'ecanique}, pages = {185--193}, publisher = {Elsevier}, volume = {339}, number = {2-3}, year = {2011}, doi = {10.1016/j.crme.2010.12.005}, language = {en}, }
TY - JOUR AU - Eric Darve AU - Cris Cecka AU - Toru Takahashi TI - The fast multipole method on parallel clusters, multicore processors, and graphics processing units JO - Comptes Rendus. Mécanique PY - 2011 SP - 185 EP - 193 VL - 339 IS - 2-3 PB - Elsevier DO - 10.1016/j.crme.2010.12.005 LA - en ID - CRMECA_2011__339_2-3_185_0 ER -
%0 Journal Article %A Eric Darve %A Cris Cecka %A Toru Takahashi %T The fast multipole method on parallel clusters, multicore processors, and graphics processing units %J Comptes Rendus. Mécanique %D 2011 %P 185-193 %V 339 %N 2-3 %I Elsevier %R 10.1016/j.crme.2010.12.005 %G en %F CRMECA_2011__339_2-3_185_0
Eric Darve; Cris Cecka; Toru Takahashi. The fast multipole method on parallel clusters, multicore processors, and graphics processing units. Comptes Rendus. Mécanique, High Performance Computing, Volume 339 (2011) no. 2-3, pp. 185-193. doi : 10.1016/j.crme.2010.12.005. https://comptes-rendus.academie-sciences.fr/mecanique/articles/10.1016/j.crme.2010.12.005/
[1] The black-box fast multipole method, Journal of Computational Physics, Volume 228 (2009) no. 23, pp. 8712-8725
[2] Rapid solution of integral equations of classical potential theory, Journal of Computational Physics, Volume 60 (1985) no. 2, pp. 187-207
[3] A fast algorithm for particle simulations, Journal of Computational Physics, Volume 73 (1987) no. 2, pp. 325-348
[4] A massively parallel adaptive fast-multipole method on heterogeneous architectures, Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, ACM, 2009, pp. 1-12
[5] F.A. Cruz, M.G. Knepley, L.A. Barba, PetFMM – A dynamically load-balancing parallel fast multipole library.
[6] A parallel hashed oct-tree n-body algorithm, Proceedings of the 1993 ACM/IEEE International Conference for High Performance Computing, Networking, Storage, and Analysis, ACM, 1993, p. 21
[7] G.M. Morton, A computer oriented geodetic data base and a new technique in file sequencing, International Business Machines Co., 1966.
[8] Bottom-up construction and 2:1 balance refinement of linear octrees in parallel, SIAM Journal on Scientific Computing, Volume 30 (2008) no. 5, pp. 2675-2708
[9] Diagnosis, tuning, and redesign for multicore performance: A case study of the fast multipole method, Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, ACM, 2010
[10] Optimizing and tuning the fast multipole method for state-of-the-art multicore architectures, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), IEEE, 2010, pp. 1-12
[11] A kernel-independent adaptive fast multipole algorithm in two and three dimensions, Journal of Computational Physics, Volume 196 (2004) no. 2, pp. 591-626
[12] Massively parallel implementation of a fast multipole method for distributed memory machines, Journal of Parallel and Distributed Computing, Volume 65 (2005) no. 7, p. 881
[13] Scalable and portable implementation of the fast multipole method on parallel computers 1, Computer Physics Communications, Volume 153 (2003) no. 3, pp. 445-461
[14] Performance of a parallel implementation of the FMM for electromagnetics applications, International Journal for Numerical Methods in Fluids, Volume 43 (2003) no. 8, pp. 865-879
[15] Parallel multilevel fast multipole method for solving large-scale problems, IEEE Antennas and Propagation Magazine, Volume 47 (2005) no. 4, p. 111
[16] NVIDIA Corporation, NVIDIA CUDA programming guide 3.0, online at www.nvidia.com, Feb 2010.
[17] C. Cecka, A.J. Lew, E. Darve, Assembly of finite element methods on graphics processors, Intl. J. Numerical Methods in Engineering (2009).
[18] Introduction to assembly of finite element methods on graphics processors, IOP Conference Series: Materials Science and Engineering, vol. 10, IOP Publishing, 2010, p. 012009
[19] Biomolecular electrostatics simulation by an FMM-based BEM on 512 GPUs, 2010 (Arxiv preprint) | arXiv
[20] 42 TFlops hierarchical N-body simulations on GPUs with applications in both astrophysics and turbulence, Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, ACM, 2009, pp. 1-12
[21] Fast multipole methods on a cluster of GPUs for the meshless simulation of turbulence, Computer Physics Communications, Volume 180 (2009) no. 11, pp. 2066-2078
[22] The study of colliding vortex rings using a special-purpose computer and FMM, Transactions of the Japan Society for Computational Engineering and Science, Volume 3 (2008), p. 13
[23] Multilevel fast multipole algorithm enhanced by GPU parallel technique for electromagnetic scattering problems, Microwave and Optical Technology Letters, Volume 52 ( Mar 2010 ) no. 3, pp. 502-507
[24] Low-frequency MLFMA on graphics processors, IEEE Antennas and Wireless Propagation Letters, Volume 9 (2010), pp. 8-11
[25] T. Takahashi, C. Cecka, E. Darve, An implementation of low-frequency fast multipole BIEM for Helmholtz' equation on GPU, in: Proceedings of JSME 23rd Computational Mechanics Conference (CD-ROM), Kitami, Hokkaido, Japan, JSME, September 2010.
Cited by Sources:
Comments - Policy