We present here the statistical models that are most in use in survival data analysis. The parametric ones are based on explicit distributions, depending only on real unknown parameters, while the preferred models are semi-parametric, like Cox model, which imply unknown functions to be estimated. Now, as big data sets are available, two types of methods are needed to deal with the resulting curse of dimensionality including non informative factors which spoil the informative part relative to the target: on one hand, methods that reduce the dimension while maximizing the information left in the reduced data, and then applying classical stochastic models; on the other hand algorithms that apply directly to big data, i.e. artificial intelligence (AI or machine learning). Actually, those algorithms have a probabilistic interpretation. We present here several of the former methods. As for the latter methods, which comprise neural networks, support vector machines, random forests and more (see second edition, January 2017 of Hastie, Tibshirani et al. (2005) [1]), we present the neural networks approach. Neural networks are known to be efficient for prediction on big data. As we analyzed, using a classical stochastic model, risk factors for Alzheimer on a data set of around 5000 patients and factors, we were interested in comparing its prediction performance with the one of a neural network on this relatively small sample size data.
Accepté le :
Publié le :
Catherine Huber-Carol 1 ; Shulamith Gross 2 ; Filia Vonta 3
@article{CRMECA_2019__347_11_817_0, author = {Catherine Huber-Carol and Shulamith Gross and Filia Vonta}, title = {Risk analysis: {Survival} data analysis vs. machine learning. {Application} to {Alzheimer} prediction}, journal = {Comptes Rendus. M\'ecanique}, pages = {817--830}, publisher = {Elsevier}, volume = {347}, number = {11}, year = {2019}, doi = {10.1016/j.crme.2019.11.007}, language = {en}, }
TY - JOUR AU - Catherine Huber-Carol AU - Shulamith Gross AU - Filia Vonta TI - Risk analysis: Survival data analysis vs. machine learning. Application to Alzheimer prediction JO - Comptes Rendus. Mécanique PY - 2019 SP - 817 EP - 830 VL - 347 IS - 11 PB - Elsevier DO - 10.1016/j.crme.2019.11.007 LA - en ID - CRMECA_2019__347_11_817_0 ER -
%0 Journal Article %A Catherine Huber-Carol %A Shulamith Gross %A Filia Vonta %T Risk analysis: Survival data analysis vs. machine learning. Application to Alzheimer prediction %J Comptes Rendus. Mécanique %D 2019 %P 817-830 %V 347 %N 11 %I Elsevier %R 10.1016/j.crme.2019.11.007 %G en %F CRMECA_2019__347_11_817_0
Catherine Huber-Carol; Shulamith Gross; Filia Vonta. Risk analysis: Survival data analysis vs. machine learning. Application to Alzheimer prediction. Comptes Rendus. Mécanique, Volume 347 (2019) no. 11, pp. 817-830. doi : 10.1016/j.crme.2019.11.007. https://comptes-rendus.academie-sciences.fr/mecanique/articles/10.1016/j.crme.2019.11.007/
[1] The elements of statistical learning: data mining, inference and prediction, Math. Intell., Volume 27 (2005) no. 2, pp. 83-85
[2] Statistical Models and Methods for Lifetime Data, Vol. 362, John Wiley & Sons, 2011
[3] Statistical Models Based on Counting Processes, Springer Science & Business Media, 2012
[4] Testing Statistical Hypotheses, Springer Science & Business Media, 2006
[5] Goodness-of-Fit Tests and Model Validity, Springer Science & Business Media, 2012
[6] Estimation in a Cox regression model with a change-point according to a threshold in a covariate, Ann. Stat., Volume 31 (2003) no. 2, pp. 442-463
[7] Modeling Survival Data: Extending the Cox Model, Springer Science & Business Media, 2013
[8] Deep learning, Nature, Volume 521 (2015) no. 7553, p. 436
[9] Interval censored and truncated data: rate of convergence of NPMLE of the density, J. Stat. Plan. Inference, Volume 139 (2009) no. 5, pp. 1734-1749
[10] Matched pair experiments: Cox and maximum likelihood estimation, Scand. J. Stat. (1987), pp. 27-41
[11] Bootstrap methods for truncated and censored data, Stat. Sin. (1996), pp. 509-530
[12] The Bootstrap and Edgeworth Expansion, Springer Science & Business Media, 2013
[13] A chi-squared test for the generalized power Weibull family for the head-and-neck cancer censored data, J. Math. Sci., Volume 133 (2006) no. 3, pp. 1333-1341
[14] Logistic regression, survival analysis, and the Kaplan-Meier curve, J. Amer. Stat. Assoc., Volume 83 (1988) no. 402, pp. 414-425
[15] Exponentiated Weibull family for analyzing bathtub failure-rate data, IEEE Trans. Reliab., Volume 42 (1993) no. 2, pp. 299-302
[16] Remarques sur le maximum de vraisemblance, Qüestiió: Quaderns d'Estad. Investig. Oper., Volume 21 (1997) no. 1
[17] Regression models for truncated survival data, Scand. J. Stat. (1992), pp. 193-213
[18] Estimation of density for arbitrarily censored and truncated data, Probability, Statistics and Modelling in Public Health, Springer, 2006, pp. 246-265
[19] Efficient regression estimation under general censoring and truncation, Mathematical and Statistical Models and Methods in Reliability, Springer, 2010, pp. 235-241
[20] Regression models and life-tables, J. R. Stat. Soc., Ser. B, Methodol., Volume 34 (1972) no. 2, pp. 187-202
[21] Analysis of Survival Data, Chapman and Hall/CRC, 2018
[22] Effects of omitting covariates in Cox's model for survival data, Scand. J. Stat. (1988), pp. 125-138
[23] Efficient estimation in a non-proportional hazards model in survival analysis, Scand. J. Stat. (1996), pp. 49-61
[24] Frailty models for arbitrarily censored and truncated data, Lifetime Data Anal., Volume 10 (2004) no. 4, pp. 369-388
[25] Semiparametric transformation models for arbitrarily censored and truncated data, Parametric and Semiparametric Models With Applications to Reliability, Survival Analysis, Quality of Life, Springer, 2004, pp. 167-176
[26] Accelerated Life Models: Modeling and Statistical Analysis, Chapman and Hall/CRC, 2001
[27] Goodness-of-fit tests for accelerated life models, Goodness-of-Fit Tests and Model Validity, Springer, 2002, pp. 281-297
[28] Analyse statistique des durées de vie: modélisation des données censurées, Journées d'Étude en Statistique, vol. 3, Marseille-Luminy, 1988
[29] Nonparametric estimation and regression analysis with left-truncated and right-censored data, J. Amer. Stat. Assoc., Volume 91 (1996) no. 435, pp. 1166-1180
[30] Robust Statistics, Wiley, New York, 1981
[31] Robust versus nonparametric approaches and survival data analysis, Advances in Degradation Modeling, Springer, 2010, pp. 323-337
[32] Threshold regression for survival analysis: modeling event times by a stochastic process reaching a boundary, Stat. Sci., Volume 21 (2006) no. 4, pp. 501-513
[33] Threshold regression for survival data with time-varying covariates, Stat. Med., Volume 29 (2010) no. 7–8, pp. 896-905
[34] Analysis of the effect of occupational exposure to asbestos based on threshold regression modeling of case–control data, Biostatistics, Volume 15 (2013) no. 2, pp. 327-340
[35] First-hitting-time based threshold regression, International Encyclopedia of Statistical Science, 2011, pp. 523-524
[36] Reduced models for fluid–structure interaction problems, Int. J. Numer. Methods Eng., Volume 60 (2004) no. 1, pp. 139-152
[37] An overview of the proper generalized decomposition with applications in computational rheology, J. Non-Newton. Fluid Mech., Volume 166 (2011) no. 11, pp. 578-592
[38] A short review on model order reduction based on proper generalized decomposition, Arch. Comput. Methods Eng., Volume 18 (2011) no. 4, p. 395
[39] Low-rank tensor methods for model order reduction, Handbook of Uncertainty Quantification, 2017, pp. 857-882
[40] Estimation dans les tableaux de contingence a un grand nombre d'entrées, Int. Stat. Rev. (1974), pp. 193-203
[41] Within the sample comparison of prediction performance of models and submodels: application to Alzheimer's disease, Statistical Models and Methods for Reliability and Survival Analysis, 2013, pp. 95-109
[42] Y. Le Cun, Personal communication, December 2018.
[43] Reliability of Engineering Systems and Technological Risk, John Wiley & Sons, 2016
[44] Statistical Models and Methods for Biomedical and Technical Systems, Springer Science & Business Media, 2008
[45] Stochastic Risk Analysis and Management, John Wiley & Sons, 2017
[46] Machine Learning: A Probabilistic Perspective, MIT Press, 2012
[47] Analysis of Microarray Gene Expression Data, Springer Science & Business Media, 2007
Cité par Sources :
Commentaires - Politique