On the convergence of formally diverging neural net-based classifiers

Leonid Berlyand; Pierre-Emmanuel Jabin

doi:10.1016/j.crma.2018.03.003

Partial differential equations/Dynamical systems

On the convergence of formally diverging neural net-based classifiers
[Convergence de classifieurs par réseaux de neurones formellement divergents]

Présenté par : Haïm Brezis

Leonid Berlyand ¹ ; Pierre-Emmanuel Jabin ²

¹ Department of Mathematics, The Pennsylvania State University, University Park, PA 16802, USA
² CSCAMM and Department of Mathematics, University of Maryland, College Park, MD 20742, USA

Comptes Rendus. Mathématique, Volume 356 (2018) no. 4, pp. 395-405.

Résumé
Abstract

Nous étudions dans cette note le comportement asymptotique d'algorithmes du gradient appliqués à des problèmes de classification basés sur des modèles élémentaires de réseaux neuronaux à apprentissage supervisé. Nous prouvons que ces algorithmes divergent au sens mathématique strict, puisque la suite de paramètres définissant le classifieur est non bornée. Toutefois, en développant des méthodes d'entropie–production d'entropie, notre approche conduit à des taux explicites qui montrent, au moins lorsque les classes peuvent être bien séparées, que les paramètres divergent seulement logarithmiquement alors que la fonction coût converge vers 0 polynomialement. En conséquence, d'un point de vue pratique, l'algorithme permet effectivement d'obtenir un classifieur avec de bonnes performances, et peut même sembler converger.

We present an analytical study of gradient descent algorithms applied to a classification problem in machine learning based on artificial neural networks. Our approach is based on entropy–entropy dissipation estimates that yield explicit rates. Specifically, as long as the neural nets remain within a set of “good classifiers”, we establish a striking feature of the algorithm: it mathematically diverges as the number of gradient descent iterations (“time”) goes to infinity but this divergence is only logarithmic, while the loss function vanishes polynomially. As a consequence, this algorithm still yields a classifier that exhibits good numerical performance and may even appear to converge.

Reçu le : 2018-03-02
Accepté le : 2018-03-08
Publié le : 2018-03-16

DOI : 10.1016/j.crma.2018.03.003

Affiliations des auteurs :

Leonid Berlyand ¹ ; Pierre-Emmanuel Jabin ²

¹ Department of Mathematics, The Pennsylvania State University, University Park, PA 16802, USA
² CSCAMM and Department of Mathematics, University of Maryland, College Park, MD 20742, USA

@article{CRMATH_2018__356_4_395_0,
     author = {Leonid Berlyand and Pierre-Emmanuel Jabin},
     title = {On the convergence of formally diverging neural net-based classifiers},
     journal = {Comptes Rendus. Math\'ematique},
     pages = {395--405},
     publisher = {Elsevier},
     volume = {356},
     number = {4},
     year = {2018},
     doi = {10.1016/j.crma.2018.03.003},
     language = {en},
}

TY  - JOUR
AU  - Leonid Berlyand
AU  - Pierre-Emmanuel Jabin
TI  - On the convergence of formally diverging neural net-based classifiers
JO  - Comptes Rendus. Mathématique
PY  - 2018
SP  - 395
EP  - 405
VL  - 356
IS  - 4
PB  - Elsevier
DO  - 10.1016/j.crma.2018.03.003
LA  - en
ID  - CRMATH_2018__356_4_395_0
ER  -

%0 Journal Article
%A Leonid Berlyand
%A Pierre-Emmanuel Jabin
%T On the convergence of formally diverging neural net-based classifiers
%J Comptes Rendus. Mathématique
%D 2018
%P 395-405
%V 356
%N 4
%I Elsevier
%R 10.1016/j.crma.2018.03.003
%G en
%F CRMATH_2018__356_4_395_0

Leonid Berlyand; Pierre-Emmanuel Jabin. On the convergence of formally diverging neural net-based classifiers. Comptes Rendus. Mathématique, Volume 356 (2018) no. 4, pp. 395-405. doi : 10.1016/j.crma.2018.03.003. https://comptes-rendus.academie-sciences.fr/mathematique/articles/10.1016/j.crma.2018.03.003/

Bibliographie
Cité par

[1] R. Balan; M. Singh; D. Zou Lipschitz properties for deep convolutional networks, Contemp. Math. (2018) (in press)

[2] O. Butkovsky Subgeometric rates of convergence of Markov processes in the Wasserstein metric, Ann. Appl. Probab., Volume 24 (2014) no. 2, pp. 526-552

[3] O.A. Butkovsky; A.Yu. Veretennikov On asymptotics for Vaserstein coupling of Markov chains, Stoch. Process. Appl., Volume 123 (2013) no. 9, pp. 3518-3541

[4] J.A. Carrillo; M. Di Francesco; A. Figalli; T. Laurent; D. Slepcev Global-in-time weak measure solutions and finite-time aggregation for nonlocal interaction equations, Duke Math. J., Volume 156 (2011), pp. 229-271

[5] X. Cheng; X. Chen; S. Mallat Deep Haar scattering networks, Inf. Inference, Volume 5 (2016) no. 2, pp. 105-133

[6] W. Czaja; W. Li Analysis of time-frequency scattering transforms, Appl. Comput. Harmon. Anal. (2018) (in press) | DOI

[7] R. Douc; G. Fort; E. Moulines; P. Soulier Practical drift conditions for subgeometric rates of convergence, Ann. Appl. Probab., Volume 14 (2004), pp. 1353-1377

[8] A. Durmus; G. Fort; E. Moulines Subgeometric rates of convergence in Wasserstein distance for Markov chains, Ann. Inst. Henri Poincaré Probab. Stat., Volume 52 (2016) no. 4, pp. 1799-1822

[9] I. Goodfellow; Y. Bengio; A. Courville Deep Learning, MIT Press, 2016 http://www.deeplearningbook.org

[10] M. Hairer; J.C. Mattingly; M. Scheutzow Asymptotic coupling and a general form of Harris' theorem with applications to stochastic delay equations, Probab. Theory Relat. Fields, Volume 149 (2011) no. 1–2, pp. 223-259

[11] G. Hinton et al. Deep neural networks for acoustic modeling in speech recognition, IEEE Signal Process. Mag., Volume 29 (2012), pp. 82-97

[12] A. Krizhevsky; I. Sutskever; G. Hinton ImageNet classification with deep convolutional neural networks, Lake Tahoe, NV, USA, 2012 (2014), pp. 109-1098

[13] Y. Le Cun; Y. Bengio; G. Hinton Deep learning, Nature, Volume 521 (2015), pp. 436-444

[14] Y. Le Cun; B. Boser; J. Denker; D. Henderson; R. Howard; W. Hubbard; L. Jackelt Handwritten digit recognition with a back-propagation network (D.S. Touretzky, ed.), Advances in Neural Information Processing Systems 2, Morgan Kaufmann, San Francisco, CA, 1990, pp. 396-404

[15] Y. Le Cun; L. Bottou; G. Orr; K. Muller Efficient BackProp (G. Orr; K. Muller, eds.), Neural Networks: Tricks of the Trade, Springer, 1998

[16] M.K. Leung; H.Y. Xiong; L.J. Lee; B.J. Frey Deep learning of the tissue regulated splicing code, Bioinformatics, Volume 30 (2014) (i121–i129)

[17] S. Mallat Group invariant scattering, Commun. Pure Appl. Math., Volume 65 (2012) no. 10, pp. 1331-1398

[18] S. Mallat Understanding deep convolutional networks, Phil. Trans. R. Soc. A, Volume 374 (2016) no. 2065

[19] S. Meyn; R.L. Tweedie Markov Chains and Stochastic Stability, Cambridge University Press, Cambridge, UK, 2009

[20] P. Michel; S. Mischler; B. Perthame General relative entropy inequality: an illustration on growth models, J. Math. Pures Appl., Volume 84 (2005) no. 9, pp. 1235-1260

[21] O. Shamir Convergence of stochastic gradient descent for PCA (preprint) | arXiv

[22] I. Sutskever, O. Vinyals, Q.V. Le, Sequence to sequence learning with neural networks, in: Proc. 28th Annual Conf. on Neural Information Processing Systems, Montreal, Canada, 8–13 December 2014.

Cité par Sources :

^☆ LB was partially supported by NSF DMREF grant DMS-1628411; PEJ was partially supported by NSF Grant DMS-161453, NSF Grant RNMS (Ki-Net) DMS-1107444 and by LTS grants DO 0048-0049-0050 and 0052.

Commentaires - Politique