Is it normal that STAN becomes really slow with large data (vector [32000])

I am using Stan to solve the following estimation problem. I have a vector S that represents the superposition of a finite but unknown number of templates (Ti):
S = (sum over i) bi * Ti + n

bi are non-negative ([0…Max]) unknown parameters. bi is initialized to a vector of random normals with mean that is close to a typical expected answer and a STD that of +/- mean. n is assumed to be Gaussian noise.

The typical size of the vector S is 32K, and the typical number of templates (range of i) is 100. (each template is of size 32K).

If I reduce the size of the vector to 1024 on simulated data, Stan generates the estimate quickly (less than 10 minutes). With my larger vector, Stan does not finish even in days.

It is not a resource problem because my machine with 32 cores and 256GB of memory is never close to fully utilized.

Here is an abbreviated version of how I set this up.

functions
{
	real dotproduct_final_lpdf(real answer,  int LENGTH , int final_size, vector scale, matrix templates, vector signal )
    {
        vector [2*final_size] scales; 
        vector [2*LENGTH] scaled_temps;
        vector [2*LENGTH] M_ST;
        vector [LENGTH] final_m_ct_1;
        vector [LENGTH] final_m_ct_2;
        vector [LENGTH] scaled_temps_1;
        vector [LENGTH] scaled_temps_2;
        vector [final_size] scale2;
        real Dot_1part = 0 ;
        real Dot_2part = 0 ;
        real Dot_product = 0 ;
        real result =0; 
        real M_cT_norm =0;  
        scale2 = rep_vector(0.0, final_size);
        
        scales = append_row(scale,scale2);   
        scaled_temps = templates*scales;
        M_ST = signal - scaled_temps ;
        
        final_m_ct_1 =  head(M_ST,LENGTH);
        final_m_ct_2 = tail(M_ST,LENGTH);
        
        scaled_temps_1 = head(scaled_temps,LENGTH);
        scaled_temps_2 = tail(scaled_temps,LENGTH);
        
        
        M_cT_norm  = sqrt ( dot_product(final_m_ct_2, final_m_ct_2) +dot_product(final_m_ct_1, final_m_ct_1) ) ;
        final_m_ct_2 = (1/M_cT_norm) * final_m_ct_2  ;
        final_m_ct_1 = (1/M_cT_norm) * final_m_ct_1  ;
        Dot_1part  =  dot_product(scaled_temps_2, final_m_ct_2) + dot_product(scaled_temps_1 , final_m_ct_1) ;
        Dot_2part = dot_product (scaled_temps_1, final_m_ct_2) -dot_product (final_m_ct_1 , scaled_temps_2 );       
        Dot_product = sqrt (Dot_1part*Dot_1part +Dot_2part*Dot_2part );
        result = 1-fabs(Dot_product);
        return result;
        
    }
	
}


data 
{
int is_it_final;
real scale_ratio2 ; 
int LENGTH;    
int final_size; 
matrix [2*LENGTH,2*final_size] templates ;
vector [2*LENGTH] signal ;
vector  [final_size] scales_mat_est ; 
vector  [final_size] scale2_mat_est ; 
}



parameters 
{
vector <lower = 0.000000001 > [final_size] scale;
real scale_final;
}

model 
{
    scale  ~ normal(scales_mat_est ,scales_mat_est* 2.0 );
    scale_final ~ dotproduct_final(LENGTH,final_size,scale,templates,signal);      
}

Have you looked at model diagnostics? That’s a more likely problem than data size

Thank you for your reply,
I tried
sm3=pystan.StanModel(model_code=FINAL_MODEL2,verbose=True,diagnostic_file=“diagnostic.txt”)

and I got this ERROR: TypeError: init() got an unexpected keyword argument ‘diagnostic_file’

I tried
fit=sm3.sampling(data=Final_Data2,iter=num_iterations_final,chains=num_chains_final,diagnostic_file=diagnostic.txt")

and I got this ERROR: raise NotImplementedError(“diagnostic_file not supported yet”)
NotImplementedError: diagnostic_file not supported yet

I meant effective sample size and r-hat for example

The diagnostics can help show problematic posterior geometry, which can be much more problematic than data size. For instance, running up against max treedepth or getting divergences.

You shouldn’t need <lower = epsilon> bounds in Stan—are you getting a lot of probability mass piling up near zero? If so, and you want to bound the estimate away, you can use zero-avoiding priors.

You may find using compound declare-define makes the programs a lot clearer. For example,

real  dotproduct_final_lpdf(real answer, int LENGTH,
                            int final_size, vector scale,
                            matrix templates, vector signal) {
  vector[final_size] scale2 = rep_vector(0, final_size);
  vector[2 * final_size] scales = append_row(scale ,scale2);   
  vector[2 * LENGTH] scaled_temps = templates * scales;
  vector[2 * LENGTH] M_ST = signal - scaled_temps;
  ...

Also, you want to avoid multiplying by zero wherever possible as it just wastes derivative time. In this case, you’re creating a vector scales which is half zeros, then multiplying a matrix by it. You really only need the columns of the matrix and vector corresponding to the non-zero vector entries. It yields the same product without all the multiplies by zero. For example, if m is an N \times M matrix and v is an M-vector of the form v = v_1, \ldots, v_K, 0, ..., 0, then

m * v == m[ , 1:K] * v[1:K]

but the right-hand side involves far less arithmetic.