The fast multipole method on parallel clusters, multicore processors, and graphics processing units

Eric Darve; Cris Cecka; Toru Takahashi

doi:10.1016/j.crme.2010.12.005

The fast multipole method on parallel clusters, multicore processors, and graphics processing units
[La méthode multipôle rapide sur des grappes d'ordinateurs en parallèle, des processeurs multi-cœurs, et des processeurs graphiques]

Eric Darve ¹ ; Cris Cecka ¹ ; Toru Takahashi ²

¹ Mechanical Engineering Department, Institute for Computational and Mathematical Engineering, Stanford University, Durand 209, 496 Lomita Mall, 94305-3030 Stanford, CA, USA
² Department of Mechanical Science and Engineering, Nagoya University, Japan

Comptes Rendus. Mécanique, Volume 339 (2011) no. 2-3, pp. 185-193.

Résumé
Abstract

Dans cet article, nous présentons l'implémentation de la méthode multipôle rapide (FMM) sur des calculateurs parallèles modernes, depuis les grappes parallèles aux processeurs multi-cœurs et aux cartes graphiques (GPU). La FMM est une application difficile à paralléliser à cause de la structure d'arbre et le fait qu'elle demande des opérations complexes qui ne sont pas structurées de façon régulière. L'algèbre linéaire computationnelle avec des matrices denses par exemple permet des optimisations qui utilisent les motifs réguliers de calcul. La FMM peut être optimisée de façon similaire mais nous verrons que la complexité de l'optimisation est supérieure. La discussion débute par une présentation générale de la FMM. On discutera brièvement des méthodes parallèles pour la FMM, comme la construction de l'arbre, et la réduction des communications durant le calcul. Finalement, nous présenterons plus en détail le développement et l'optimisation de la FMM sur GPUs.

In this article, we discuss how the fast multipole method (FMM) can be implemented on modern parallel computers, ranging from computer clusters to multicore processors and graphics cards (GPU). The FMM is a somewhat difficult application for parallel computing because of its tree structure and the fact that it requires many complex operations which are not regularly structured. Computational linear algebra with dense matrices for example allows many optimizations that leverage the regular computation pattern. FMM can be similarly optimized but we will see that the complexity of the optimization steps is greater. The discussion will start with a general presentation of FMMs. We briefly discuss parallel methods for the FMM, such as building the FMM tree in parallel, and reducing communication during the FMM procedure. Finally, we will focus on porting and optimizing the FMM on GPUs.

Publié le : 2011-01-11

DOI : 10.1016/j.crme.2010.12.005

Keywords: Computer science, Fast multipole method, Parallel computer
Mot clés : Informatique, Méthode multipôle rapide, Calculateur parallèle

Affiliations des auteurs :

Eric Darve ¹ ; Cris Cecka ¹ ; Toru Takahashi ²

¹ Mechanical Engineering Department, Institute for Computational and Mathematical Engineering, Stanford University, Durand 209, 496 Lomita Mall, 94305-3030 Stanford, CA, USA
² Department of Mechanical Science and Engineering, Nagoya University, Japan

@article{CRMECA_2011__339_2-3_185_0,
     author = {Eric Darve and Cris Cecka and Toru Takahashi},
     title = {The fast multipole method on parallel clusters, multicore processors, and graphics processing units},
     journal = {Comptes Rendus. M\'ecanique},
     pages = {185--193},
     publisher = {Elsevier},
     volume = {339},
     number = {2-3},
     year = {2011},
     doi = {10.1016/j.crme.2010.12.005},
     language = {en},
}

TY  - JOUR
AU  - Eric Darve
AU  - Cris Cecka
AU  - Toru Takahashi
TI  - The fast multipole method on parallel clusters, multicore processors, and graphics processing units
JO  - Comptes Rendus. Mécanique
PY  - 2011
SP  - 185
EP  - 193
VL  - 339
IS  - 2-3
PB  - Elsevier
DO  - 10.1016/j.crme.2010.12.005
LA  - en
ID  - CRMECA_2011__339_2-3_185_0
ER  -

%0 Journal Article
%A Eric Darve
%A Cris Cecka
%A Toru Takahashi
%T The fast multipole method on parallel clusters, multicore processors, and graphics processing units
%J Comptes Rendus. Mécanique
%D 2011
%P 185-193
%V 339
%N 2-3
%I Elsevier
%R 10.1016/j.crme.2010.12.005
%G en
%F CRMECA_2011__339_2-3_185_0

Eric Darve; Cris Cecka; Toru Takahashi. The fast multipole method on parallel clusters, multicore processors, and graphics processing units. Comptes Rendus. Mécanique, Volume 339 (2011) no. 2-3, pp. 185-193. doi : 10.1016/j.crme.2010.12.005. https://comptes-rendus.academie-sciences.fr/mecanique/articles/10.1016/j.crme.2010.12.005/

Bibliographie
Cité par

[1] W. Fong; E. Darve The black-box fast multipole method, Journal of Computational Physics, Volume 228 (2009) no. 23, pp. 8712-8725

[2] V. Rokhlin Rapid solution of integral equations of classical potential theory, Journal of Computational Physics, Volume 60 (1985) no. 2, pp. 187-207

[3] L. Greengard; V. Rokhlin A fast algorithm for particle simulations, Journal of Computational Physics, Volume 73 (1987) no. 2, pp. 325-348

[4] I. Lashuk; A. Chandramowlishwaran; H. Langston; T.A. Nguyen; R. Sampath; A. Shringarpure; R. Vuduc; L. Ying; D. Zorin; G. Biros A massively parallel adaptive fast-multipole method on heterogeneous architectures, Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, ACM, 2009, pp. 1-12

[5] F.A. Cruz, M.G. Knepley, L.A. Barba, PetFMM – A dynamically load-balancing parallel fast multipole library.

[6] M.S. Warren; J.K. Salmon A parallel hashed oct-tree n-body algorithm, Proceedings of the 1993 ACM/IEEE International Conference for High Performance Computing, Networking, Storage, and Analysis, ACM, 1993, p. 21

[7] G.M. Morton, A computer oriented geodetic data base and a new technique in file sequencing, International Business Machines Co., 1966.

[8] H. Sundar; R.S. Sampath; G. Biros Bottom-up construction and 2:1 balance refinement of linear octrees in parallel, SIAM Journal on Scientific Computing, Volume 30 (2008) no. 5, pp. 2675-2708

[9] K. Madduri; R. Vuduc Diagnosis, tuning, and redesign for multicore performance: A case study of the fast multipole method, Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, ACM, 2010

[10] A. Chandramowlishwaran; S. Williams; L. Oliker; I. Lashuk; G. Biros; R. Vuduc Optimizing and tuning the fast multipole method for state-of-the-art multicore architectures, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), IEEE, 2010, pp. 1-12

[11] L. Ying; G. Biros; D. Zorin A kernel-independent adaptive fast multipole algorithm in two and three dimensions, Journal of Computational Physics, Volume 196 (2004) no. 2, pp. 591-626

[12] J. Kurzak; B.M. Pettitt Massively parallel implementation of a fast multipole method for distributed memory machines, Journal of Parallel and Distributed Computing, Volume 65 (2005) no. 7, p. 881

[13] S. Ogata; T.J. Campbell; R.K. Kalia; A. Nakano; P. Vashishta; S. Vemparala Scalable and portable implementation of the fast multipole method on parallel computers 1, Computer Physics Communications, Volume 153 (2003) no. 3, pp. 445-461

[14] G. Sylvand Performance of a parallel implementation of the FMM for electromagnetics applications, International Journal for Numerical Methods in Fluids, Volume 43 (2003) no. 8, pp. 865-879

[15] F. Wu; Y. Zhang; Z.Z. Oo; E. Li Parallel multilevel fast multipole method for solving large-scale problems, IEEE Antennas and Propagation Magazine, Volume 47 (2005) no. 4, p. 111

[16] NVIDIA Corporation, NVIDIA CUDA programming guide 3.0, online at www.nvidia.com, Feb 2010.

[17] C. Cecka, A.J. Lew, E. Darve, Assembly of finite element methods on graphics processors, Intl. J. Numerical Methods in Engineering (2009).

[18] C. Cecka; A. Lew; E. Darve Introduction to assembly of finite element methods on graphics processors, IOP Conference Series: Materials Science and Engineering, vol. 10, IOP Publishing, 2010, p. 012009

[19] R. Yokota; T. Hamada; J.P. Bardhan; M.G. Knepley; L.A. Barba Biomolecular electrostatics simulation by an FMM-based BEM on 512 GPUs, 2010 (Arxiv preprint) | arXiv

[20] T. Hamada; T. Narumi; R. Yokota; K. Yasuoka; K. Nitadori; M. Taiji 42 TFlops hierarchical N-body simulations on GPUs with applications in both astrophysics and turbulence, Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, ACM, 2009, pp. 1-12

[21] R. Yokota; T. Narumi; R. Sakamaki; S. Kameoka; S. Obi; K. Yasuoka Fast multipole methods on a cluster of GPUs for the meshless simulation of turbulence, Computer Physics Communications, Volume 180 (2009) no. 11, pp. 2066-2078

[22] T.K. Sheel; R. Yokota; K. Yasuoka; S. Obi The study of colliding vortex rings using a special-purpose computer and FMM, Transactions of the Japan Society for Computational Engineering and Science, Volume 3 (2008), p. 13

[23] K. Xu; D.Z. Ding; Z.H. Fan; R.S. Chen Multilevel fast multipole algorithm enhanced by GPU parallel technique for electromagnetic scattering problems, Microwave and Optical Technology Letters, Volume 52 ( Mar 2010 ) no. 3, pp. 502-507

[24] M. Cwikla; J. Aronsson; V. Okhmatovski Low-frequency MLFMA on graphics processors, IEEE Antennas and Wireless Propagation Letters, Volume 9 (2010), pp. 8-11

[25] T. Takahashi, C. Cecka, E. Darve, An implementation of low-frequency fast multipole BIEM for Helmholtz' equation on GPU, in: Proceedings of JSME 23rd Computational Mechanics Conference (CD-ROM), Kitami, Hokkaido, Japan, JSME, September 2010.

Cité par Sources :

Commentaires - Politique