Error at the end of the sampling phase

Hi, I have a model and I can perfectly build and fit it on my machine.
The problem is that when I run it on docker with many sample > 200 I have a weird error:

Sampling: 90% (2700/3000)
Sampling: 93% (2800/3000)
Sampling: 97% (2900/3000)INFO:httpstan:Operation operations/rgc3xdv4 finished.
CRITICAL:httpstan:Operation operations/odhmza7x cancelled before finishing.
ERROR:asyncio:Exception in callback handle_create_fit.._services_call_done({‘done’: True, ‘metadata’: {‘fit’: {‘name’: ‘models/3bd4r…fits/odhmza7x’}, ‘progress’: ‘Iteration: 1…] (Sampling)’}, ‘name’: ‘operations/odhmza7x’})(>) at /usr/local/lib/python3.9/site-packages/httpstan/views.py:367
handle: <Handle handle_create_fit.._services_call_done({‘done’: True, ‘metadata’: {‘fit’: {‘name’: ‘models/3bd4r…fits/odhmza7x’}, ‘progress’: ‘Iteration: 1…] (Sampling)’}, ‘name’: ‘operations/odhmza7x’})(>) at /usr/local/lib/python3.9/site-packages/httpstan/views.py:367>
Traceback (most recent call last):
File “/usr/local/lib/python3.9/asyncio/runners.py”, line 44, in run
return loop.run_until_complete(main)
File “/usr/local/lib/python3.9/asyncio/base_events.py”, line 642, in run_until_complete
return future.result()
File “/usr/local/lib/python3.9/site-packages/stan/model.py”, line 198, in go
resp = await client.get(f"/{operation[‘name’]}“)
File “/usr/local/lib/python3.9/site-packages/stan/common.py”, line 46, in get
async with self.session.get(f”{self.base_url}{path}") as resp:
File “/usr/local/lib/python3.9/site-packages/aiohttp/client.py”, line 1117, in aenter
self._resp = await self._coro
File “/usr/local/lib/python3.9/site-packages/aiohttp/client.py”, line 544, in _request
await resp.start(conn)
File “/usr/local/lib/python3.9/site-packages/aiohttp/client_reqrep.py”, line 905, in start
self._continue = None
File “/usr/local/lib/python3.9/site-packages/aiohttp/helpers.py”, line 656, in exit
raise asyncio.TimeoutError from None
asyncio.exceptions.TimeoutError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/usr/local/lib/python3.9/site-packages/httpstan/services_stub.py”, line 145, in call
await asyncio.sleep(0.001)
File “/usr/local/lib/python3.9/asyncio/tasks.py”, line 655, in sleep
return await future
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/usr/local/lib/python3.9/asyncio/events.py”, line 80, in _run
self._context.run(self._callback, *self._args)
File “/usr/local/lib/python3.9/site-packages/httpstan/views.py”, line 380, in _services_call_done
exc = future.exception()
asyncio.exceptions.CancelledError
Traceback (most recent call last):
File “//main.py”, line 174, in
main()
File “//main.py”, line 109, in main
latent_model.fit_model(
File “/src/models/latent.py”, line 100, in fit_model
self.fit = self.posterior.sample(num_chains=num_chains, num_samples=samples)
File “/usr/local/lib/python3.9/site-packages/stan/model.py”, line 84, in sample
return self.hmc_nuts_diag_e_adapt(num_chains=num_chains, **kwargs)
File “/usr/local/lib/python3.9/site-packages/stan/model.py”, line 103, in hmc_nuts_diag_e_adapt
return self._create_fit(function=function, num_chains=num_chains, **kwargs)
File “/usr/local/lib/python3.9/site-packages/stan/model.py”, line 306, in _create_fit
return asyncio.run(go())
File “/usr/local/lib/python3.9/asyncio/runners.py”, line 44, in run
return loop.run_until_complete(main)
File “/usr/local/lib/python3.9/asyncio/base_events.py”, line 642, in run_until_complete
return future.result()
File “/usr/local/lib/python3.9/site-packages/stan/model.py”, line 198, in go
resp = await client.get(f"/{operation[‘name’]}“)
File “/usr/local/lib/python3.9/site-packages/stan/common.py”, line 46, in get
async with self.session.get(f”{self.base_url}{path}") as resp:
File “/usr/local/lib/python3.9/site-packages/aiohttp/client.py”, line 1117, in aenter
self._resp = await self._coro
File “/usr/local/lib/python3.9/site-packages/aiohttp/client.py”, line 544, in _request
await resp.start(conn)
File “/usr/local/lib/python3.9/site-packages/aiohttp/client_reqrep.py”, line 905, in start
self._continue = None
File “/usr/local/lib/python3.9/site-packages/aiohttp/helpers.py”, line 656, in exit
raise asyncio.TimeoutError from None
asyncio.exceptions.TimeoutError

When the sample size is smaller (100) it runs with no error.
What can be the cause?

Hi,

how do you call pystan? With 8-schools example?

HI!
I made a custom class, but in the end I do the same thing:

def _read_model_build(filename, data):
        """Read stan model from file and build

        Args:
            filename (str): file path
            data (dict): Stan data to fit

        Returns:
            Model: stan model
        """
        with open(filename) as f:
            model_code = f.read()
            sm = stan.build(model_code, data=data)
            return sm

def fit_model(self, data, num_chains=2, samples=500):
        """Fit model

        Args:
            data (dict): stan data
            num_chains (int, optional): Number of markov chains. Defaults to 2.
            samples (int, optional): Number of samples to drawn for each chain. Defaults to 500.
        """
        self.posterior = self._read_model_build(self.file, data)
        self.sample_count = num_chains * samples
        self.fit = self.posterior.sample(num_chains=num_chains, num_samples=samples)

I wonder if SO_REUSEPORT is not enabled in your docker image. I have no idea how to enable it, but probably there are some network settings that can be edited. This is apparently something that can cause failures with python multiprocessing and docker.

Does these errors happen if you use only 1 chain?

Off topic: (I recommend closing the with statement before returning the model so the file is not kept open)

        with open(filename) as f:
            model_code = f.read()
        sm = stan.build(model_code, data=data)
        return sm