Is it normal that STAN becomes really slow with large data (vector [32000])

Ehsan_Ahmady · March 18, 2019, 8:29pm

I am using Stan to solve the following estimation problem. I have a vector S that represents the superposition of a finite but unknown number of templates (Ti):
S = (sum over i) bi * Ti + n

bi are non-negative ([0…Max]) unknown parameters. bi is initialized to a vector of random normals with mean that is close to a typical expected answer and a STD that of +/- mean. n is assumed to be Gaussian noise.

The typical size of the vector S is 32K, and the typical number of templates (range of i) is 100. (each template is of size 32K).

If I reduce the size of the vector to 1024 on simulated data, Stan generates the estimate quickly (less than 10 minutes). With my larger vector, Stan does not finish even in days.

It is not a resource problem because my machine with 32 cores and 256GB of memory is never close to fully utilized.

Here is an abbreviated version of how I set this up.

functions
{
	real dotproduct_final_lpdf(real answer,  int LENGTH , int final_size, vector scale, matrix templates, vector signal )
    {
        vector [2*final_size] scales; 
        vector [2*LENGTH] scaled_temps;
        vector [2*LENGTH] M_ST;
        vector [LENGTH] final_m_ct_1;
        vector [LENGTH] final_m_ct_2;
        vector [LENGTH] scaled_temps_1;
        vector [LENGTH] scaled_temps_2;
        vector [final_size] scale2;
        real Dot_1part = 0 ;
        real Dot_2part = 0 ;
        real Dot_product = 0 ;
        real result =0; 
        real M_cT_norm =0;  
        scale2 = rep_vector(0.0, final_size);
        
        scales = append_row(scale,scale2);   
        scaled_temps = templates*scales;
        M_ST = signal - scaled_temps ;
        
        final_m_ct_1 =  head(M_ST,LENGTH);
        final_m_ct_2 = tail(M_ST,LENGTH);
        
        scaled_temps_1 = head(scaled_temps,LENGTH);
        scaled_temps_2 = tail(scaled_temps,LENGTH);
        
        
        M_cT_norm  = sqrt ( dot_product(final_m_ct_2, final_m_ct_2) +dot_product(final_m_ct_1, final_m_ct_1) ) ;
        final_m_ct_2 = (1/M_cT_norm) * final_m_ct_2  ;
        final_m_ct_1 = (1/M_cT_norm) * final_m_ct_1  ;
        Dot_1part  =  dot_product(scaled_temps_2, final_m_ct_2) + dot_product(scaled_temps_1 , final_m_ct_1) ;
        Dot_2part = dot_product (scaled_temps_1, final_m_ct_2) -dot_product (final_m_ct_1 , scaled_temps_2 );       
        Dot_product = sqrt (Dot_1part*Dot_1part +Dot_2part*Dot_2part );
        result = 1-fabs(Dot_product);
        return result;
        
    }
	
}


data 
{
int is_it_final;
real scale_ratio2 ; 
int LENGTH;    
int final_size; 
matrix [2*LENGTH,2*final_size] templates ;
vector [2*LENGTH] signal ;
vector  [final_size] scales_mat_est ; 
vector  [final_size] scale2_mat_est ; 
}



parameters 
{
vector <lower = 0.000000001 > [final_size] scale;
real scale_final;
}

model 
{
    scale  ~ normal(scales_mat_est ,scales_mat_est* 2.0 );
    scale_final ~ dotproduct_final(LENGTH,final_size,scale,templates,signal);      
}

sakrejda · March 19, 2019, 1:20am

Have you looked at model diagnostics? That’s a more likely problem than data size

Ehsan_Ahmady · March 19, 2019, 4:08pm

Thank you for your reply,
I tried
sm3=pystan.StanModel(model_code=FINAL_MODEL2,verbose=True,diagnostic_file=“diagnostic.txt”)

and I got this ERROR: TypeError: init() got an unexpected keyword argument ‘diagnostic_file’

I tried
fit=sm3.sampling(data=Final_Data2,iter=num_iterations_final,chains=num_chains_final,diagnostic_file=diagnostic.txt")

and I got this ERROR: raise NotImplementedError(“diagnostic_file not supported yet”)
NotImplementedError: diagnostic_file not supported yet

sakrejda · March 19, 2019, 8:43pm

I meant effective sample size and r-hat for example

Bob_Carpenter · March 29, 2019, 7:56pm

The diagnostics can help show problematic posterior geometry, which can be much more problematic than data size. For instance, running up against max treedepth or getting divergences.

You shouldn’t need <lower = epsilon> bounds in Stan—are you getting a lot of probability mass piling up near zero? If so, and you want to bound the estimate away, you can use zero-avoiding priors.

You may find using compound declare-define makes the programs a lot clearer. For example,

real  dotproduct_final_lpdf(real answer, int LENGTH,
                            int final_size, vector scale,
                            matrix templates, vector signal) {
  vector[final_size] scale2 = rep_vector(0, final_size);
  vector[2 * final_size] scales = append_row(scale ,scale2);   
  vector[2 * LENGTH] scaled_temps = templates * scales;
  vector[2 * LENGTH] M_ST = signal - scaled_temps;
  ...

Bob_Carpenter · March 29, 2019, 8:07pm

Also, you want to avoid multiplying by zero wherever possible as it just wastes derivative time. In this case, you’re creating a vector scales which is half zeros, then multiplying a matrix by it. You really only need the columns of the matrix and vector corresponding to the non-zero vector entries. It yields the same product without all the multiplies by zero. For example, if m is an N \times M matrix and v is an M-vector of the form v = v_1, \ldots, v_K, 0, ..., 0, then

m * v == m[ , 1:K] * v[1:K]

but the right-hand side involves far less arithmetic.

Topic		Replies	Views
Large data sets with stan code Modeling	1	703	September 18, 2018
Rstan vector size error RStan fitting-issues	1	374	June 8, 2020
Rstan error during sampling for large dataset General rstan , optimization	1	645	May 5, 2021
Launch_shinystan rstanarm shinystan	4	984	December 19, 2018
Error in unserialize(socklist[[n]]) with large-ish input data RStan	9	1079	November 8, 2018

Is it normal that STAN becomes really slow with large data (vector [32000])

Related topics