Outline
Comptes Rendus

Biological modelling / Biomodélisation
Simulation-based estimation of stochastic process parameters in tumor growth
Comptes Rendus. Biologies, Volume 327 (2004) no. 3, pp. 181-192.

Abstract

Models are generally developed at the micro level. Data are generally gathered at the macro level. Obtaining the macromodel which is the natural consequence of the underlying micro model is generally not feasible. SIMEST gives a means whereby the micromodel is used to generate, for a given assumed set of parameters, simulated sets of macro data. These data are compared with the actual clinical macro data. The parameters are then adjusted to obtain concordance with the clinical data. In this manner, simulation gives us a means of parameter estimation without the necessity of generating the macro model.

Metadata
Received:
Accepted:
Published online:
DOI: 10.1016/j.crvi.2003.06.001
Keywords: SIMEST, simulation, tumor growth, pseudo-data

James R. Thompson 1

1 Department of Statistics, Rice University, 6100 South Main Street, Houston, TX 77001-1892, USA
@article{CRBIOL_2004__327_3_181_0,
     author = {James R. Thompson},
     title = {Simulation-based estimation of stochastic process parameters in tumor growth},
     journal = {Comptes Rendus. Biologies},
     pages = {181--192},
     publisher = {Elsevier},
     volume = {327},
     number = {3},
     year = {2004},
     doi = {10.1016/j.crvi.2003.06.001},
     language = {en},
}
TY  - JOUR
AU  - James R. Thompson
TI  - Simulation-based estimation of stochastic process parameters in tumor growth
JO  - Comptes Rendus. Biologies
PY  - 2004
SP  - 181
EP  - 192
VL  - 327
IS  - 3
PB  - Elsevier
DO  - 10.1016/j.crvi.2003.06.001
LA  - en
ID  - CRBIOL_2004__327_3_181_0
ER  - 
%0 Journal Article
%A James R. Thompson
%T Simulation-based estimation of stochastic process parameters in tumor growth
%J Comptes Rendus. Biologies
%D 2004
%P 181-192
%V 327
%N 3
%I Elsevier
%R 10.1016/j.crvi.2003.06.001
%G en
%F CRBIOL_2004__327_3_181_0
James R. Thompson. Simulation-based estimation of stochastic process parameters in tumor growth. Comptes Rendus. Biologies, Volume 327 (2004) no. 3, pp. 181-192. doi : 10.1016/j.crvi.2003.06.001. https://comptes-rendus.academie-sciences.fr/biologies/articles/10.1016/j.crvi.2003.06.001/

Version originale du texte intégral

1 The SIMEST paradigm

Following the argument developed in [1], we shall be creating pseudo-datasets for given assumed parameters of a micromodel for a set of input model parameters. As we note from Fig. 1, these pseudo-data will be compared with actual clinical data and the assumed parameters adjusted to bring the simulated pseudo-data in concordance with the clinical data.

Fig. 1

The SIMEST paradigm.

2 Poisson process modeling

In 1837, well before there were the plethora of technological processes which suit his modeling strategy, Poisson [2] proposed the following model to deal with Pk(t)= Prob [k events in [0,t)]. Everything flows from the following four axioms:

  • (1) Pr [1 event in [t,t+h)]=λh+o(h)
  • (2) Pr [2 or more events in [t,t+h)]=o(h)
  • (3) Pr [j events in [t1,s1) and k in [t2,s2)]= Pr [j in [t1,s1)] Pr [k in [t2,s2)] if [t1,s1)[t2,s2)]=ϕ
  • (4) λ is constant over time.
Then
Pk(t+h)=Pk(t)P0(h)+Pk-1(t)P1(h)=Pk(t)1-λh+o(h)+Pk-1(t)λh+o(h)
so
Pk(t+h)-Pk(t)=λhPk-1(t)-Pk(t)+o(h)(1)
Dividing by h and letting h→∞, we have
dPk(t)dt=λPk-1(t)-Pk(t)(2)
Simple substitution in (2) verifies Poisson's solution:
Pk(t)=e-λt(λt)kk!(3)
The mean and variance of k are easily shown both to be equal to λt.

Let us consider an early application of Poisson's model. The German statistician von Bortkiewicz examined the number of suicides of women in eight German States in 14 years. His results are shown in Table 1 [3]. Now, there it is an interesting question as to whether it can plausibly be claimed that the suicide data follows Poisson's model. If we compute the sample mean of the number of suicides per year, we find that it is 3.473. We can then use this value as an estimate for λt. In Table 1, we also show the expected numbers of suicides using the Poisson model.

Table 1

Actual and expected numbers of suicides per year

Suicides 0 1 2 3 4 5 6 7 8 9 10 Sum
Freq. 9 19 17 20 15 11 8 2 3 5 3 112
E (Freq.) 3.5 12.1 21 24.3 21 14.6 8.5 4.2 1.9 0.7 0.2 112

One of the oldest statistical tests is Karl Pearson's goodness of fit. When data is naturally categorized, as it is here, in k bins (the number of suicides per state per year), if the number observed in a bin is Xi and the expected number, according to a model, is Ei, then

i=1k(Xi-Ei)2Eiχ2(k-1)(4)
For the von Bortkiewicz data, we compute a value of χ2 of 54.9. This is well beyond the limit of χ20.990(10) value of 23.21, so we might reject the applicability of the Poisson model. On the other hand, the Pearson approximation is asymptotic. We require a minimum number for each Ei of 5. In the present example, that would mean that we would have to pool the first two bins and the last four. That would give the revised Table 2.

Table 2

Actual and expected numbers of suicides per year

Suicides 1 2 3 4 5 6 7 Sum
Freq. 28 17 20 15 11 8 13 112
E (Freq.) 15.6 21 24.3 21 14.6 8.5 7 112

This gives us a χ2 value of 19.15, which is above the χ20.990(6) value of 16.81, but below the χ20.998(6) value of 20.79. Depending upon the use we intend to make of the Poisson model, we might choose to accept it. Yet, the relatively small sample involved might make us wish to try other approaches. For example, we know we have totals of suicides per year given in Table 1. We might decide to employ the following strategy.

Algorithm

Resampled data compared with model-generated data.

  • 1. Create an ‘urn’ with nine 0 balls, nineteen 1 balls, seventeen 2 balls, and so on.
  • 2. With replacement, sample from the urn 1000 samples of size 112, noting the results.
  • 3. For each of the 1000 samples, compute the χ2 statistic in (4) using the original values in Table 1 for the Ei.
  • 4. Using the estimate for λt of 3.473, divide the line segment from zero to 1 according to the Poisson model. Thus the probability of finding a state with zero suicides in a year is exp(−λt)=0.031. The 0 Poisson bin then is [0,0.031). The probability of finding a state with one suicide in a year is exp(−λt)λt/1!=0.108. So the 1 Poisson bin is [0.031,0.031+0.108), and so on.
  • 5. Repeat 1000 times 112 draws of a uniform [0,1] random variable.
  • 6. Using the Ei values from the third row in Table 1, compute the χ2 statistic in (4).
  • 7. Compute histograms for both the resampling simulation and that of the Poisson model. If the overlap is, say 5%, accept the hypothesis that the Poisson model fits the data.

Uses of the Poissonian framework are seen, very frequently, in the simulation of a train of time-indexed events. Now,

1-F(t)=P0 events in [0,t)= exp [-λt](5)
But F(t), the probability that an event occurs on or before t, is a continuous cumulative distribution function and is distributed as a uniform variate on [0,1]. So, also,is 1−F(t). Thus, it is an easy matter, starting at time zero, to simulate the time of the next event. We generate u from U[0,1]. Then the time of the next simulated event is given by
t=-1λ log (u)(6)
Thus, it is possible to create a series of n simulated events by simply generating u1,u2,…,un and then using
ti=ti-1-1λ log (ui)(7)

Next, let us consider what might be done in if we relax the axiom that states that λ must be constant. We can easily do this for the special case where we are considering P0(t), that is, the probability that no events happen in the time interval [0,t)

P0(t+h)-P0(t)=λh-P0(t)+o(h)(8)
Dividing by h and taking the limit as h goes to zero, we have
1P0(t)dP0(t)dt=-λ(t)(9)
Integrating from 0 to t, we have
P0(t)= exp -t0λ(τ)dτ(10)
We are now able to carry out simulations in rather complicated situations. Let us suppose for example, that a tumor, starting with one cell, grows exponentially according to:
v(t)=ceαt, where c is the volume of one cell (11)
Next, let us suppose that this tumor will throw off metastases at a rate a proportional to the volume of the tumor. So, then the probability a metastasis will be produced on or before time t is given by
FM(t)=1- exp - ac αeαtM(12)
From (12) we can easily write a simulation for the origination times of metastases starting from a tumor with given values of c, α, and a.

3 SIMEST: an oncological example

The power of the computer as an aid to modeling does not get the attention it deserves. Part of the reason is that the human modeling approach tends to be analog rather than digital. Analog computers were replaced by digital computers 40 years ago. Most statisticians remain fascinated by the graphical capabilities of the digital computer. The exploratory data analysis route tends to attempt to replace modeling by visual displays which are then interpreted, in a more-or-less instinctive fashion, by an observer. Statisticians who proceed in this way are functioning somewhat like prototypical cyborgs. After over two decades of seeing data spun, colored, and graphed in a myriad of ways, I have to admit to being disappointed when comparing the promise of EDA with its reality. Its influence amongst academic statisticians has been enormous. Visualization is clearly one of the major areas in the statistical literature. But the inferences drawn from these visualizations in the real world are, relatively speaking, not so numerous. Moreover, when visualization-based inferences are drawn, they tend to give results one might have obtained by classical techniques.

Of course, as in the case of using the computer as a nonparametric smoother, some uses are better than others. It is extremely unfortunate that some are so multicultural in their outlook that they rearrange their research agenda in order to accommodate themselves to our analog-challenged friends, the digital computers. Perhaps the greatest disappointment is to see the modeling aspect of our analog friends, the human beings, being disregarded in favor of using them as gestaltic image processors. This really will not do. We need to rearrange the agenda so that the human beings can gain the maximal assistance from the computers in making inferences from data. That is the purpose of SIMEST.

There is an old adage to the effect that quantitative change carried far enough may produce qualitative change. The fact is that we now have computers so fast and cheap that we can proceed (almost) as though computation were free and instantaneous (with infinite accessible memory thrown in as well). This should change, fundamentally, the way we approach data analysis in the light of models.

There are now a number of examples in several fields where SIMEST has been used to obtain estimates of the parameters characterizing a market-related applied stochastic process. Below we consider an oncological application to motivate and to explicate SIMEST. We shall first show a traditional model-based data analysis, note the serious (generally insurmountable) difficulties involved, and then give a simulation-based, highly computer-intensive way to get what we require to understand the process and act upon that understanding.

3.1 An exploratory prelude

In the late 1970s, my colleague Barry W. Brown, of the University of Texas M.D. Anderson Cancer Center, and I had started to investigate some conjectures concerning reasons for the relatively poor performance of oncology in the American ‘War on Cancer’. Huge amounts of resources had been spent with less encouraging results than one might have hoped. It was my view that part of the reason might be that the basic orthodoxy for cancer progression was, somehow, flawed.

This basic orthodoxy can be summarized briefly as follows.

At some time, for some reason, a single cell goes wild. It, and its progeny, multiply at rates greater than that required for replacement. The tumor thus formed grows more or less exponentially. From time to time, a cell may break off (metastasize) from the tumor and start up a new tumor at some distance from the primary (original) tumor. The objective of treatment is to find and excise the primary before it has had a chance to form metastases. If this is done, then the surgeon (or radiologist) will have “gotten it all” and the patient is cured. If metastases are formed before the primary is removed, then a cure is unlikely, but the life of the patient may be extended and ameliorated by aggressive administration of chemotherapeutic agents which will kill tumor cells more vigorously than normal cells. Unfortunately, since the agents do attack normal cells as well, a cure of metastasized cancer is unlikely, since the patient's body cannot sustain the dosage required to kill all the cancer cells.

For some cancers, breast cancer, for example, long-term cure rates had not improved very much for many years.

3.2 Model and algorithms

One conjecture, consistent with a roughly constant intensity of display of secondary tumors, is that a patient with a tumor of a particular type is not displaying breakaway colonies only, but also new primary tumors due to suppression of a patient's immune system to attack tumors of a particular type. We can formulate axioms at the micro level which will incorporate the mechanism of new primaries.

Such an axiomitization has been formulated by Bartoszyński et al. [4]. The first five axioms are consistent with the classical view as to metastatic progression. Hypothesis 6 is the mechanism we introduce to explain the nonincreasing intensity function of secondary tumor display.

Hypothesis 1

For any patient, each tumor originates from a single cell and grows at exponential rate α.

Hypothesis 2

The probability that the primary tumor will be detected and removed in [t,tt) is given by bY0(tt+o(Δt), and until the removal of the primary, the probability of a metastasis in [t,tt) is aY0(tt+o(Δt), where Y0(t) is the size of the primary tumor at time t.

Hypothesis 3

For patients with no discovery of secondary tumors in the time of observation, S, put m1(t)=Y1(t)+Y2(t)+⋯, where Yi(t) is the size of the ith originating tumor. After removal of the primary, the probability of a metastasis in [t,tt) equals am1(t)+o(Δt), and the probability of detection of a new tumor in [t,tt), is bm1(t)+o(Δt).

Hypothesis 4

For patients who do display a secondary tumor, after removal of the primary and before removal of Y1, the probability of detection of a tumor in [t,tt) equals bY1(t)+o(Δt), while the probability of detection of a metastasis is aY1(t)+o(Δt).

Hypothesis 5

For patients who do display a secondary tumor, the probability of a metastasis in [t,tt) is am2(tt+o(Δt), while the probability of detection of a tumor is bm2(tt+o(Δt), where m2(t)=Y2(t)+⋯.

Hypothesis 6

The probability of a systemic occurrence of a tumor in [t,tt) equals λΔt+o(Δt), independent of the prior history of the patient.

Essentially, we shall attempt to develop the likelihood function for this model so that we can find the values of a, b, α, and λ which maximize the likelihood of the data set observed. It turns out that this is a formidable task indeed. The SIMEST algorithm which we develop later gives a quick alternative to finding the likelihood function. However, to give the reader some feel as to the complexity associated with model aggregation from seemingly innocent axioms, we shall give some of the details of getting the likelihood function. First of all, it turns out that in order to have any hope of obtaining a reasonable approximation to the likelihood function, we will have to make some further simplifying assumptions. We shall refer to the period prior to detection of the primary as Phase 0. Phase 1 is the period from detection of the primary to S′, the first time of detection of a secondary tumor. For those patients without a secondary tumor, Phase 1 is the time of observation, S. Phase 2 is the time, if any, between S′ and S. Now for the two simplifying axioms. T0 is defined to be the (unobservable) time between the origination of the primary and the time when it is detected and removed (at time t=0). T1 and T2 are the times until detection and removal of the first and second of the subsequent tumors (times to be counted from t=0). We shall let X be the total mass of all tumors other than the primary at t=0.

Hypothesis 7

For patients who do not display a secondary tumor, growth of the primary tumor, and of all tumors in Phase 1, is deterministically exponential with the growth of all other tumors treated as a pure birth process.

Hypothesis 8

For patients who display a secondary tumor, the growth of the following tumors is treated as deterministic: in Phase 0, tumors Y0(t) and Y1(t); in Phase 1, tumor Y1(t) and all tumors which originated in Phase 0; in Phase 2, all tumors. The growth of remaining tumors in Phases 0 and 1 is treated as a pure birth process.

We now define

H(s;t,z)= exp az αeαtes-1 log 1+e-αt-1e-s+λαs-λα log 1+eαtes-1(13)
and
p(t;z)= bz eαt exp - bz αeαt-1(14)
Further, we shall define
w(y)=λy0e-ν(u)du-y(15)
where ν(u) is determined from
u=ν0a+b+αs-ae-s-1ds(16)

Then, we can establish the following propositions, and from these, the likelihood function:

p(T0>τ)= exp -bτ0eαtdt= exp -bαeατ-1(17)
For patients who do not display a secondary tumor, we have
P(T1>S|X=x)= exp -xν(S)+w(S)(18)
For patients who develop metastases, we have
P(T1>S)=P no secondary tumor in (0,S)=0ew(s)p(t;1)Hν(s);t,1dt(19)
Similarly, for patients who do display a secondary tumor, we have
P(T1=S',T2>S)=0t0ew(S-S')p(t;1)pS';eαuλ+aeα(t-u) exp -λ(t-u)-aαeα(t-u)-1Hν(S-S');S',eαuHν(S-S')eαS';u,eα(t-u)dudt+0S'0ew(S-S')p(t;1) exp -λt-aαeαt-1λe-λup(S'-u;1)Hν(S-S');S'-u,1dudt(20)

Finding the likelihood function, even a quadrature approximation to it, is more than difficult. Furthermore, current symbol manipulation programs (e.g., Mathematica, Maple) do not have the capability of doing the work. Accordingly, it must be done by hand. Approximately 1.5 person years were required to obtain a quadrature approximation to the likelihood. Before starting this activity, we had no idea of the rather practical difficulties involved. However, the activity was not without reward.

We found estimates for the parameter values using a data set consisting of 116 women who presented with primary breast cancer at the Curie-Sklodowska Cancer Institute in Warsaw (time units in months, volume units in cells): a=0.17×10−9, b=0.23×10−8, α=0.31, and λ=0.0030. Using these parameter values, we found excellent agreement between the proportion free of metastasis versus time obtained from the data and that obtained from the model, using the parameter values given above. When we tried to fit the model to the data with the constraint that λ=0 (that is, disregarding the systemic process as is generally done in oncology), the attempt failed.

One thing one always expects from a model-based approach is that, once the relevant parameters have been estimated, many things one had not planned to look for can be found. For example, tumor doubling time is 2.2 months. The median time from primary origination to detection is 59.2 months and at this time the tumor consists of 9.3×107 cells. The probability of metastasis prior to detection of the primary is 0.069, and so on. A model-based approach generally yields such serendipitous results, as a nonparametric approach generally does not. It is worth mentioning that, more frequently than one realizes, we need an analysis which is flexible, in the event that at some future time we need to answer questions different from those originally posed. The quadrature approximation of the likelihood is relatively inflexible compared to the simulation-based approach we shall develop shortly.

Insofar as the relative importance of the systemic and metastatic mechanisms, in causing secondary tumors associated with breast cancer, it would appear from Fig. 2 that the systemic one is the more important. This result is surprising, but is consistent with what we have seen in our exploratory analysis of another tumor system (melanoma). Interestingly, it is by no means true that for all tumor systems the systemic term has such dominance. For primary lung cancer, for example, the metastatic term appears to be far more important.

Fig. 2

Metastatic and systemic effects. ○ log[(prob(metastatic tumor originating by time t))], □ log[(prob(systemic tumor originating by time t))].

It is not clear how to postulate, in any definitive fashion, a procedure for testing the null hypothesis of the existence of a systemic mechanism in the progression of cancer. We have already noted that when we suppress the systemic hypothesis, we cannot obtain even a poor maximum likelihood fit to the data. However, someone might argue that a different set of nonsystemic axioms should have been proposed. Obviously, we cannot state that it is simply impossible to manage a good fit without the systemic hypothesis. However, it is true that the nonsystemic axioms we have proposed are a fair statement of traditional suppositions as to the growth and spread of cancer.

As a practical matter, we had to use data that were oriented toward the life of the patient rather than toward the life of a tumor system. This is due to the fact that human in vivo cancer data is seldom collected with an idea toward modeling tumor systems. For a number of reasons, including the difficulty mentioned in obtaining the likelihood function, deep stochastic modeling has not traditionally been employed by many investigators in oncology. Modeling frequently precedes the collection of the kinds of data of greatest use in the estimation of the parameters of the model. Anyone who has gone through a modeling exercise such as that covered in this section is very likely to treat such an exercise as a once in a lifetime experience. It simply is too frustrating to have to go through all the flailing around to come up with a quadrature approximation to the likelihood function. As soon as a supposed likelihood function has been found, and a corresponding parameter estimation algorithm constructed, the investigator begins a rather lengthy ‘debugging’ experience. The algorithm's failure to work might be due to any number of reasons (e.g., a poor approximation to the likelihood function, a poor quadrature routine, a mistake in the code of the algorithm, inappropriateness of the model, etc.). Typically, the debugging process is time consuming and difficult. If one is to have any hope for coming up with a successful model-based investigation, an alternative to the likelihood procedure for aggregation must be found.

In order to decide how best to construct an algorithm for parameter estimation which does not have the difficulties associated with the classical closed-form approach, we should try to see just what causes the difficulty with the classical method of aggregating from the microaxioms to the macro level, where the data lives. A glance at Fig. 3 reveals the problem with the closed-form approach.

Fig. 3

Two possible paths from primary to secondary.

The axioms of tumor growth and spread are easy enough to implement in the forward direction. Indeed, they follow the natural forward formulation used since Poisson's work of 1837. Essentially, we are overlaying stochastic processes, one on top of the other, and interdependently to boot. But when we go through the task of finding the likelihood, we are essentially seeking all possible paths by which the observables could have been generated.

The secondary tumor, originating at time t3, could have been thrown off from the primary at time t3, or it could have been thrown off from a tumor which itself was thrown off from another tumor at time t2 which itself was thrown off from a tumor at time t1 from the primary which originated at time t0. The number of possibilities is, of course, infinite.

In other words, the problem with the classical likelihood approach in the present context is that it is a backward look from a database generated in the forward direction. To scientists before the present generation of fast, cheap computers, the backward approach was, essentially, unavoidable unless one avoided such problems (a popular way out of the dilemma). However, we need not be so restricted.

Once we realize the difficulty when one uses a backward approach with a concatenation of forwardly axiomitized mechanisms, the way out of our difficulty is rather clear. We need to analyze the data using a forward formulation. The most obvious way to carry this out is to pick a guess for the underlying vector of parameters, put this guess in the micro-axiomitized model and simulate many times of appearance of secondary tumors. Then, we can compare the set of simulated quasidata with that of the actual data.

The greater the concordance, the better we will believe we have done in our guess for the underlying parameters. If we can quantitize this measure of concordance, then we will have a means for guiding us in our next guess. One such way to carry this out would be to order the secondary occurrences in the data set from smallest to largest and divide them into k bins, each with the same proportion of the data. Then, we could note the proportions of quasidata points in each of the bins.

If the proportions observed for the quasidata, corresponding to parameter value Θ, were denoted by {πj(Θ)}j=1k, then a Pearson goodness-of-fit statistic would be given by:

χ2(Θ)=j=1k(πj(Θ)-1/k)2πj(Θ)(21)
The minimization of χ2(Θ) provides us with a means of estimating Θ.

Typically, the sample size, n, of the data will be much less than N, the size of the simulated quasidata. With mild regularity conditions, assuming there is only one local maximum of the likelihood function, Θ0, as n→∞ (which function we of course do not know), then as N→∞, as n becomes large and k increases in such a way that limn→∞k=∞ and limn→∞k/n=0, the minimum χ2 estimator for Θ0 will have an expected mean square error which approaches the expected mean square error of the maximum likelihood estimator. This is, obviously, quite a bonus. Essentially, we will be able to forfeit the possibility of knowing the likelihood function and still obtain an estimator with asymptotic efficiency equal to that of the maximum likelihood estimator. The price to be paid is the acquisition of a computer swift enough and cheap enough to carry out a very great number, N, of simulations, say 10 000. This ability to use the computer to get us out of the ‘backward trap’ is a potent but, as yet seldom used, bonus of the computer age. Currently, the author is using SIMEST on a 400 MHz personal computer, amply adequate for the task, which now costs around $1000.

First, we observe how the forward approach enables us to eliminate those hypotheses which were, essentially a practical necessity if a likelihood function was to be obtained. Our new axioms are simply:

Hypothesis 1

For any patient, each tumor originates from a single cell and grows at exponential rate α.

Hypothesis 2

The probability that the primary tumor will be detected and removed in [t,tt) is given by bY0(tt+o(Δt). The probability that a tumor of size Y(t) will be detected in [t,tt) is given by bY(tt+o(Δt).

Hypothesis 3

The probability of a metastasis in [t,t+Δ) is aΔt× (total tumor mass present).

Hypothesis 4

The probability of a systemic occurrence of a tumor in [t,tt) equals λΔt+o(Δt), independent of the prior history of the patient.

In order to simulate, for a given value of (α,a,b,λ), a quasidata set of secondary tumors, we must first define:

  • tD=time of detection of primary tumor;
  • tM=time of origin of first metastasis;
  • tS=time of origin of first systemic tumor;
  • tR=time of origin of first recurrent tumor;
  • td=time from tR to detection of first recurrent tumor;
  • tDR=time from tD to detection of first recurrent tumor.

Now, generating a random number u from the uniform distribution on the unit interval, if F(·) is the appropriate cumulative distribution function for a time, t, we set t=F−1(u). Then, assuming that the tumor volume at time t is:

v(t)=ceαt, where c is the volume of one cell (22)
we have
FM(t)=1- exp -acαeαtM(23)
Similarly, we have
FD(tD)=1- exp -tD0bceατdτ=1- exp -bcαeαtD,(24)
FS=1-e-λtS(25)
and
Fd(td)=1- exp -bcαeαtd(26)
Using the actual times of discovery of secondary tumors t1t2⩽⋯⩽tn we generate k bins. In actual tumor situations, because of recording protocols, we may not be able to put the same number of secondary tumors in each bin. Let us suppose that the observed proportions are given by (p1,p2,…,pk). We shall generate N recurrences s1<s2<⋯<sN. The observed proportions in each of the bins will be denoted π1,π2,…,πk. The goodness of fit corresponding to (α,λ,a,b) will be given by:
χ2(α,λ,a,b)=j=1k(πj(α,λ,a,b)-pj)2πj(α,λ,a,b)(27)
As a practical matter, we may replace πj(α,λ,a,b) by pj, since with (α,λ,a,b) far away from truth, πj(α,λ,a,b) may well be zero. Then the following algorithm generates the times of detection of quasisecondary tumors for the particular parameter value (α,λ,a,b).

Algorithm

Secondary tumor simulation (α,λ,a,b).

  • Generate tD
  • j=0
  • i=0
  • Repeat until tM(j)>tD
  • j=j+1
  • Generate tM(j)
  • Generate tdM(j)
  • tdM(j)←tdM(j)+tM(j)
  • If tdM(j)<tD, then tdM(j)←∞
  • Repeat until tS>10tD
  • i=i+1
  • Generate tdS(i)
  • tdS(i)←tdS(i)+tS(i)
  • s←min[tdM(j),tdS(i)]
  • Return s
  • End repeat

The algorithm above does still have some simplifying assumptions. For example, we assume that metastases of metastases will probably not be detected before the metastases themselves. We assume that the primary will be detected before a metastasis, and so on. Note, however, that the algorithm utilizes much less restrictive simplifying assumptions than those which led to the terms of the closed-form likelihood. Even more importantly, the Secondary Tumor Simulation algorithm can be discerned in a few minutes, whereas a likelihood argument is frequently the work of months.

Another advantage of the forward simulation approach is its ease of modification. Those who are familiar with ‘backward’ approaches based on the likelihood or the moment generating function are only too familiar with the experience of a slight modification causing the investigator to go back to the start and begin anew. This is again a consequence of the tangles required to be examined if a backward approach is used. However, a modification of the axioms generally causes slight inconvenience to the forward simulator.

For example, we might add the following

Hypothesis 5

A fraction γ of the patients ceases to be at systemic risk at the time of removal of the primary tumor if no secondary tumors exist at that time. A fraction 1−γ of the patients remain at systemic risk throughout their lives.

Clearly, adding Hypothesis 5 will cause considerable work if we insist on using the classical aggregation approach of maximum likelihood. However, in the forward simulation method we simply add the following lines to the secondary tumor simulation code:

  • Generate u from U(0,1)
  • If u>γ, then proceed as in the secondary tumor simulation code
  • If u<γ, then proceed as in the secondary tumor simulation code except replace the step ‘Repeat until tS>10tD’ with the step ‘Repeat until tS(i)>tD’.

In the discussion of metastasis and systemic occurrence of secondary tumors, we have used a model supported by data to try to gain some insight into a part of the complexities of the progression of cancer in a patient. Perhaps this sort of approach should be termed speculative data analysis.

In the current example, we were guided by a nonparametric intensity function estimate, which was surprisingly nonincreasing, to conjecture a model, which enabled us to test systemic origin against metastatic origin on something like a level playing field. The fit without the systemic term was so bad that anything like a comparison of goodness-of-fit statistics was unnecessary.

It is interesting to note that the implementation of SIMEST is generally faster on the computer than working through the estimation with the closed-form likelihood. In the four-parameter oncological example we have considered here, the running time of SIMEST was 10% of the likelihood approach. As a very practical matter, then, the simulation-based approach would appear to majorize that of the closed-form likelihood method in virtually all particulars. The running time for SIMEST can begin to become a problem as the dimensionality of the response variable increases past one. Up to this point, we have been working with the situation where the data consists of failure times. In the systemic versus metastatic oncogenesis example, we managed to estimate four parameters based on this kind of one-dimensional data. As a practical matter, for tumor data, the estimation of five or six parameters for failure time data is the most one can hope for. Indeed, in the oncogenesis example, we begin to observe the beginnings of singularity for four parameters, due to a near trade-off between the parameters a and b. Clearly, it is to our advantage to be able to increase the dimensionality of our observables. For example, with cancer data, it would be to our advantage to utilize not only the time from primary diagnosis and removal to secondary discovery and removal, but also the tumor volumes of the primary and the secondary. Such information enables one to postulate more individual growth rates for each patient. Thus, it is now appropriate to address the question of dealing with multivariate response data.

Gaussian template criterion

In many cases, it will be possible to employ a procedure using a criterion function. Such a procedure has proved quite successful in another context (see [5], pp. 275–280). First, we transform the data {Xi}i=1n by a linear transformation such that for the transformed data set {Ui}i=1n the mean vector becomes zero and the covariance matrix becomes I:

U= AX +b(28)
Then, for the current best guess for Θ, we simulate a quasidata set of size N. Next, we apply the same transformation to the quasidata set {Yj(Θ)}j=1N, yielding {Zj(Θ)}j=1N. Assuming that both the actual data set and the simulated data set come from the same density, the likelihood ratio Λ(Θ) should increase as Θ gets closer to the value of Θ, say Θ0, which gave rise to the actual data, where,
Λ(Θ)=i=1n exp [-12(u1i2++u pi 2)]i=1N exp [-12(z1i2++z pi 2)](29)
As soon as we have a criterion function, we are able to develop an algorithm for estimating Θ0. The closer Θ is to Θ0, the smaller will Λ(Θ) tend to be.

The procedure above which uses a single Gaussian template will work well in many cases where the data has one distinguishable center and a falling off away from that center which is not too ‘taily’. However, there will be cases where we cannot quite get away with such a simple approach. For example, it is possible that a data set may have several distinguishable modes and/or exhibit very heavy tails. In such a case, we may be well advised to try a more local approach. Suppose that we pick one of the n data points at random – say x1 – and find the m nearest neighbors amongst the data.

We then treat this m nearest-neighbor cloud as if it came from a Gaussian distribution centered at the sample mean of the cloud and with covariance matrix estimated from the cloud. We transform these m+1 points to zero mean and identity covariance matrix, via

U=A1X+b1(30)

Now, from our simulated set of N points, we find the N(m+1)/n simulated points nearest to the mean of the m+1 actual data points. This will give us an expression like

Λ1(Θ)=i=1m+1 exp [-12(u1i2++u pi 2)]i=1N(m+1)/n exp [-12(z1i2++z pi 2)](31)
If we repeat this operation for each of the n data points, we will have a set of local likelihood ratios {Λ1,Λ2,…,Λn}. Then one natural measure of concordance of the simulated data with the actual data would be
Λ(Θ)=i=1n log Λi(Θ)(32)
We note that this procedure is not equivalent to one based on density estimation, since the nearest-neighbor ellipsoids are not disjoint. Nevertheless, we have a level playing field for each of the guesses for Θ and the resulting simulated data sets.

A simple counting criterion

Fast computing notwithstanding, with n in the 1000 range and N around 10,000, the template procedure can become prohibitively time consuming. Accordingly, we may opt for a subset counting procedure:

For data size n, pick a smaller value, say nn.

Pick a random subset of the data points of size nn.

Pick a nearest neighbor outreach parameter m, typically 0.02n.

For each of the nn data points, Xj, find the Euclidean distance to the mth nearest neighbor, say dj,m.

For an assumed value of the vector parameter Θ, generate N simulated observations.

For each of the data points in the random subset of the data, find the number of simulated observations within dj,m, say Nj,m.

Then the criterion function becomes

χ2(Θ)=j=1 nn ((m+1)/n-Nj,m/N)2(m+1)/n

Experience indicates that whatever nn size subset of the data points is selected should be retained throughout the changes of Θ. Otherwise, practical instability may obscure the path to the minimum value of the criterion function.

A SIMDAT-SIMEST stopping rule

We may use the resampling algorithm SIMDAT to compare the results from resampled data points with those from model-based simulations. SIMDAT is not a simple resampling so much as it is a stochastic interpolator. We can take the original data and use SIMDAT to generate a SIMDAT pseudodata set of N values.

Then, for a particular guess of Θ, we can compute a SIMEST pseudodata set of N values. For any region of the space of the vector observable, the number of SIMEST-generated points should be approximately equal to the number of SIMDAT-generated points. For example, let us suppose that we pick nn of the n original data points and find the radius dj,m of the hypersphere which includes m of the data points for, say, point Xj. Let Nj,SD be the number of SIMDAT-generated points falling inside the hypersphere and Nj,SE be the number of SIMEST-generated points falling inside the hypersphere. Consider the empirical goodness-of-fit statistic for the SIMDAT cloud about point Xj:

χj, SD 2(Θ)=((m+1)/n-Nj, SD /N)2(m+1)/n
For the SIMEST cloud, we have
χj, SE 2(Θ)=((m+1)/n-Nj, SE /N)2(m+1)/n
If the model is correct and if our estimate for Θ is correct, then χj,SE2(Θ) should be, on the average, distributed similarly to χj,SE2(Θ). Accordingly, we can construct a sign test. To do so, let:
Wj=+1 if χj, SD 2(Θ)χj, SE 2(Θ)=-1 if χj, SD 2(Θ)<χj, SE 2(Θ)
So, if we let:
Z=j=1 nn Wj nn
we might decide to terminate our search for estimating Θ when the absolute value of Z falls below 3 or 4.


References

[1] J.R. Thompson Simulation: A Modeler's Approach, Wiley, New York, 2000

[2] S.D. Poisson Recherches sur la probabilité des jugements en matière criminelle et en matière civile, précedées des règles générales du calcul des probabilités, Paris, 1837

[3] A. Stuart; J.K. Ord, Kendall's Advanced Theory of Statistics, vol. 1, Oxford University Press, New York, 1987, p. 7

[4] R. Bartoszyński; B.W. Brown; J.R. Thompson Metastatic and systemic factors in neoplastic progression (L. LeCam; J. Neyman, eds.), Probability Models and Cancer, Academic Press, New York, 1982, pp. 253-264 (283–285)

[5] J.R. Thompson; R.A. Tapia Nonparametric Function Estimation, Modeling, and Simulation, SIAM, Philadelphia, 1990, pp. 214-226


Comments - Policy