CADET Optimization Stalling Using Mealpy

Hey,

I am using mealpy for my PCC optimization and I am having problems with the optimization stalling. Mealpy handles the parallelization and I think that might were things are going wrong. I talked high power computing cluster support people and here is what they found:

I’ve been watching on r04n00 and your Python script starts out running as expected:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

65370 dhaught 20 0 3194440 414208 84220 S 113.7 0.2 22:37.39 python

It was using barely over 100% of a CPU core, which upon investigation was a single computational thread and additional threads (garbage collection, etc.) that were spinning waiting to be woken-up by the interpreter. Eventually, 32 cadet-cli processes appeared along with 32 multiprocessing threads:

(gdb) info threads

Id Target Id Frame

  • 1 Thread 0x2b38a58b98c0 (LWP 65370) “python” 0x00002b38a5ab7adb in do_futex_wait.constprop ()

from /lib64/libpthread.so.0

2 Thread 0x2b38c1600700 (LWP 65373) “jemalloc_bg_thd” 0x00002b38a5ab5965 in pthread_cond_wait@@GLIBC_2.3.2 ()

from /lib64/libpthread.so.0

3 Thread 0x2b38c293e700 (LWP 65374) “python” 0x00002b38a5ab5965 in pthread_cond_wait@@GLIBC_2.3.2 ()

from /lib64/libpthread.so.0

4 Thread 0x2b38c2b3f700 (LWP 65375) “python” 0x00002b38a5ab5965 in pthread_cond_wait@@GLIBC_2.3.2 ()

from /lib64/libpthread.so.0

5 Thread 0x2b38c2d40700 (LWP 65376) “python” 0x00002b38a5ab5965 in pthread_cond_wait@@GLIBC_2.3.2 ()

from /lib64/libpthread.so.0

6 Thread 0x2b38c2f41700 (LWP 65377) “python” 0x00002b38a5ab5965 in pthread_cond_wait@@GLIBC_2.3.2 ()

from /lib64/libpthread.so.0

7 Thread 0x2b38c3142700 (LWP 65378) “python” 0x00002b38a5ab5965 in pthread_cond_wait@@GLIBC_2.3.2 ()

from /lib64/libpthread.so.0

8 Thread 0x2b38c3343700 (LWP 65379) “python” 0x00002b38a5ab5965 in pthread_cond_wait@@GLIBC_2.3.2 ()

from /lib64/libpthread.so.0

9 Thread 0x2b38c3544700 (LWP 65380) “python” 0x00002b38a5ab5965 in pthread_cond_wait@@GLIBC_2.3.2 ()

from /lib64/libpthread.so.0

10 Thread 0x2b38c3745700 (LWP 65381) “python” 0x00002b38a5ab5965 in pthread_cond_wait@@GLIBC_2.3.2 ()

from /lib64/libpthread.so.0

11 Thread 0x2b38dbd87700 (LWP 65651) “python” 0x00002b38a64c238d in poll () from /lib64/libc.so.6

12 Thread 0x2b38dbf88700 (LWP 65652) “python” 0x00002b38a64c238d in poll () from /lib64/libc.so.6

13 Thread 0x2b38e4200700 (LWP 65653) “python” 0x00002b38a64c238d in poll () from /lib64/libc.so.6

14 Thread 0x2b38e4401700 (LWP 65654) “python” 0x00002b38a64c238d in poll () from /lib64/libc.so.6

15 Thread 0x2b38e4602700 (LWP 65655) “python” 0x00002b38a64c238d in poll () from /lib64/libc.so.6

16 Thread 0x2b38e4803700 (LWP 65656) “python” 0x00002b38a64c238d in poll () from /lib64/libc.so.6

17 Thread 0x2b38e4a04700 (LWP 65657) “python” 0x00002b38a64c238d in poll () from /lib64/libc.so.6

18 Thread 0x2b38e4c05700 (LWP 65658) “python” 0x00002b38a64c238d in poll () from /lib64/libc.so.6

19 Thread 0x2b38e4e06700 (LWP 65659) “python” 0x00002b38a64c238d in poll () from /lib64/libc.so.6

20 Thread 0x2b38e5007700 (LWP 65660) “python” 0x00002b38a64c238d in poll () from /lib64/libc.so.6

21 Thread 0x2b38e5208700 (LWP 65661) “python” 0x00002b38a64c238d in poll () from /lib64/libc.so.6

22 Thread 0x2b38e5409700 (LWP 65677) “python” 0x00002b38a64c238d in poll () from /lib64/libc.so.6

23 Thread 0x2b38e560a700 (LWP 65678) “python” 0x00002b38a64c238d in poll () from /lib64/libc.so.6

24 Thread 0x2b38e580b700 (LWP 65679) “python” 0x00002b38a64c238d in poll () from /lib64/libc.so.6

25 Thread 0x2b38e5a0c700 (LWP 65681) “python” 0x00002b38a64c238d in poll () from /lib64/libc.so.6

26 Thread 0x2b38e5c0d700 (LWP 65682) “python” 0x00002b38a64c238d in poll () from /lib64/libc.so.6

27 Thread 0x2b38e5e0e700 (LWP 65683) “python” 0x00002b38a64c238d in poll () from /lib64/libc.so.6

28 Thread 0x2b38e600f700 (LWP 65684) “python” 0x00002b38a64c238d in poll () from /lib64/libc.so.6

29 Thread 0x2b38e6210700 (LWP 65685) “python” 0x00002b38a64c238d in poll () from /lib64/libc.so.6

30 Thread 0x2b38e6411700 (LWP 65686) “python” 0x00002b38a64c238d in poll () from /lib64/libc.so.6

31 Thread 0x2b38e6612700 (LWP 65687) “python” 0x00002b38a64c238d in poll () from /lib64/libc.so.6

32 Thread 0x2b38e6813700 (LWP 65688) “python” 0x00002b38a64c238d in poll () from /lib64/libc.so.6

33 Thread 0x2b38e6a14700 (LWP 65689) “python” 0x00002b38a64c238d in poll () from /lib64/libc.so.6

34 Thread 0x2b38e6c15700 (LWP 65690) “python” 0x00002b38a64c238d in poll () from /lib64/libc.so.6

35 Thread 0x2b38e6e16700 (LWP 65692) “python” 0x00002b38a64c238d in poll () from /lib64/libc.so.6

36 Thread 0x2b38e7017700 (LWP 65693) “python” 0x00002b38a64c238d in poll () from /lib64/libc.so.6

37 Thread 0x2b38e7218700 (LWP 65694) “python” 0x00002b38a64c238d in poll () from /lib64/libc.so.6

38 Thread 0x2b38e7419700 (LWP 65695) “python” 0x00002b38a64c238d in poll () from /lib64/libc.so.6

39 Thread 0x2b38e761a700 (LWP 65697) “python” 0x00002b38a64c238d in poll () from /lib64/libc.so.6

40 Thread 0x2b38e781b700 (LWP 65698) “python” 0x00002b38a64c238d in poll () from /lib64/libc.so.6

41 Thread 0x2b38e7a1c700 (LWP 65699) “python” 0x00002b38a64c238d in poll () from /lib64/libc.so.6

42 Thread 0x2b38e7c21700 (LWP 65700) “python” 0x00002b38a64c238d in poll () from /lib64/libc.so.6

UID PID PPID C STIME TTY TIME CMD

dhaught 65311 65301 0 13:21 ? 00:00:00 /bin/bash -l /var/spool/slurm/job29471921/slurm_script

dhaught 65370 65311 98 13:21 ? 00:27:33 python Fit_LSHADE_PCC_V2.py

dhaught 67802 65370 99 13:49 ? 00:00:18 /home/3563/.conda/cadet/11_12_2024_CadetP_Core5/bin/cadet-cli /wor

dhaught 67836 65370 99 13:49 ? 00:00:17 /home/3563/.conda/cadet/11_12_2024_CadetP_Core5/bin/cadet-cli /wor

dhaught 67868 65370 99 13:49 ? 00:00:17 /home/3563/.conda/cadet/11_12_2024_CadetP_Core5/bin/cadet-cli /wor

dhaught 67900 65370 99 13:49 ? 00:00:16 /home/3563/.conda/cadet/11_12_2024_CadetP_Core5/bin/cadet-cli /wor

dhaught 67932 65370 89 13:49 ? 00:00:13 /home/3563/.conda/cadet/11_12_2024_CadetP_Core5/bin/cadet-cli /wor

dhaught 67933 65370 99 13:49 ? 00:00:15 /home/3563/.conda/cadet/11_12_2024_CadetP_Core5/bin/cadet-cli /wor

dhaught 67996 65370 99 13:49 ? 00:00:15 /home/3563/.conda/cadet/11_12_2024_CadetP_Core5/bin/cadet-cli /wor

dhaught 67998 65370 99 13:49 ? 00:00:15 /home/3563/.conda/cadet/11_12_2024_CadetP_Core5/bin/cadet-cli /wor

dhaught 67999 65370 99 13:49 ? 00:00:15 /home/3563/.conda/cadet/11_12_2024_CadetP_Core5/bin/cadet-cli /wor

dhaught 68031 65370 78 13:49 ? 00:00:11 /home/3563/.conda/cadet/11_12_2024_CadetP_Core5/bin/cadet-cli /wor

dhaught 68051 65370 99 13:49 ? 00:00:15 /home/3563/.conda/cadet/11_12_2024_CadetP_Core5/bin/cadet-cli /wor

dhaught 68064 65370 99 13:49 ? 00:00:15 /home/3563/.conda/cadet/11_12_2024_CadetP_Core5/bin/cadet-cli /wor

dhaught 68065 65370 99 13:49 ? 00:00:15 /home/3563/.conda/cadet/11_12_2024_CadetP_Core5/bin/cadet-cli /wor

dhaught 68066 65370 90 13:49 ? 00:00:13 /home/3563/.conda/cadet/11_12_2024_CadetP_Core5/bin/cadet-cli /wor

dhaught 68084 65370 99 13:49 ? 00:00:15 /home/3563/.conda/cadet/11_12_2024_CadetP_Core5/bin/cadet-cli /wor

dhaught 68192 65370 99 13:49 ? 00:00:15 /home/3563/.conda/cadet/11_12_2024_CadetP_Core5/bin/cadet-cli /wor

dhaught 68193 65370 99 13:49 ? 00:00:15 /home/3563/.conda/cadet/11_12_2024_CadetP_Core5/bin/cadet-cli /wor

dhaught 68243 65370 99 13:49 ? 00:00:15 /home/3563/.conda/cadet/11_12_2024_CadetP_Core5/bin/cadet-cli /wor

dhaught 68258 65370 99 13:49 ? 00:00:15 /home/3563/.conda/cadet/11_12_2024_CadetP_Core5/bin/cadet-cli /wor

dhaught 68290 65370 99 13:49 ? 00:00:15 /home/3563/.conda/cadet/11_12_2024_CadetP_Core5/bin/cadet-cli /wor

dhaught 68343 65370 99 13:49 ? 00:00:14 /home/3563/.conda/cadet/11_12_2024_CadetP_Core5/bin/cadet-cli /wor

dhaught 68354 65370 99 13:49 ? 00:00:15 /home/3563/.conda/cadet/11_12_2024_CadetP_Core5/bin/cadet-cli /wor

dhaught 68355 65370 99 13:49 ? 00:00:14 /home/3563/.conda/cadet/11_12_2024_CadetP_Core5/bin/cadet-cli /wor

dhaught 68356 65370 99 13:49 ? 00:00:15 /home/3563/.conda/cadet/11_12_2024_CadetP_Core5/bin/cadet-cli /wor

dhaught 68388 65370 99 13:49 ? 00:00:14 /home/3563/.conda/cadet/11_12_2024_CadetP_Core5/bin/cadet-cli /wor

dhaught 68389 65370 99 13:49 ? 00:00:14 /home/3563/.conda/cadet/11_12_2024_CadetP_Core5/bin/cadet-cli /wor

dhaught 68421 65370 99 13:49 ? 00:00:15 /home/3563/.conda/cadet/11_12_2024_CadetP_Core5/bin/cadet-cli /wor

dhaught 68500 65370 99 13:49 ? 00:00:15 /home/3563/.conda/cadet/11_12_2024_CadetP_Core5/bin/cadet-cli /wor

dhaught 68501 65370 99 13:49 ? 00:00:15 /home/3563/.conda/cadet/11_12_2024_CadetP_Core5/bin/cadet-cli /wor

dhaught 68502 65370 99 13:49 ? 00:00:15 /home/3563/.conda/cadet/11_12_2024_CadetP_Core5/bin/cadet-cli /wor

dhaught 68549 65370 99 13:49 ? 00:00:15 /home/3563/.conda/cadet/11_12_2024_CadetP_Core5/bin/cadet-cli /wor

dhaught 68574 65370 99 13:49 ? 00:00:15 /home/3563/.conda/cadet/11_12_2024_CadetP_Core5/bin/cadet-cli /wor

dhaught 65371 65311 0 13:21 ? 00:00:00 tee output.log

Additional cadet-cli processes (totaling the number of data points) were started and ended until all cadet-cli processes were completed and it was back to just the Python process, right up until a single cadet-cli was started:

UID PID PPID C STIME TTY TIME CMD

dhaught 65311 65301 0 13:21 ? 00:00:00 /bin/bash -l /var/spool/slurm/job29471921/slurm_script

dhaught 65370 65311 92 13:21 ? 00:34:40 python Fit_LSHADE_PCC_V2.py

dhaught 65371 65311 0 13:21 ? 00:00:00 tee output.log

UID PID PPID C STIME TTY TIME CMD

dhaught 65311 65301 0 13:21 ? 00:00:00 /bin/bash -l /var/spool/slurm/job29471921/slurm_script

dhaught 65370 65311 92 13:21 ? 00:34:42 python Fit_LSHADE_PCC_V2.py

dhaught 71072 65370 0 13:59 ? 00:00:04 /home/3563/.conda/cadet/11_12_2024_CadetP_Core5/bin/cadet-cli /wor

dhaught 65371 65311 0 13:21 ? 00:00:00 tee output.log

On this pass, there were only ever 7 cadet-cli programs running and there are at most 7 HDF5 files in ./tmp at all times. Eventually this winds down to just a single cadet-cli program, and that one has been running since ca. 15:30 yesterday. I attached to it with gdb and set a few breakpoints to isolate what it’s doing:

(gdb) bt

#0 0x00002af4f61fbd25 in IDAStep () from /home/3563/.conda/cadet/11_12_2024_CadetP_Core5/bin/…/lib/libcadet.so.0

#1 0x00002af4f6201e1d in IDASolve () from /home/3563/.conda/cadet/11_12_2024_CadetP_Core5/bin/…/lib/libcadet.so.0

#2 0x00002af4f5d1647b in cadet::Simulator::integrate() ()

from /home/3563/.conda/cadet/11_12_2024_CadetP_Core5/bin/…/lib/libcadet.so.0

#3 0x00005586e34a61af in main ()

The program keeps entering and exiting the [IDASolve(), IDAStep(), …] sequence of functions but has been inside cadet::Simulator::integrate() the whole time. I would have to surmise that there are some numerical stability or convergence issues that are preventing the function from reaching a conclusion. If you can enable greater verbosity in the execution of the cadet-cli program, that might help.

I was wondering if yall know what is going on?
I am using this CADET Porcess :GitHub - angela-moser/CADET-Process: A Framework for Modelling and Optimizing Advanced Chromatographic Processes
I also am using CADET Core 5.0
Here are the files from the HPC run
Caviness_PCC_Opti_V10.zip (1008.3 KB)

Hey Daniel,

this looks to me like the optimizer chooses a parameter combination that causes numerical stability issues in CADET-Core (aka cadet-cli).

To further investigate you can, as a first step, identify which parameter combination caused CADET-Core to stall. Can you retrieve the parameter values used in the stuck process from the mealpy log?

If it’s not currently possible these are some setups that come to mind:

  1. add a timeout to the CADET-Process simulator with simulator.timeout = 10 * 60 for a 10 minute timeout (or whatever time is reasonable to allow other simulations to finish but terminate the stuck simulation in a timely fashion). Then catch the CADETProcessError('Simulator timed out') that will be raised and add the Sol_Params that raised the error into the log.
  2. add additional logging to the objective_multi function to log the Sol_Params + started at the beginning and additionally add a log entry with the Sol_Params + completed flag once a simulation finishes. Then look for the set of parameters that start and don’t complete.

Once you have the parameters, you can use them to run a simulation with the same parameters locally. Then there are two possible paths:

  1. The simulation does get stuck if run locally. Then we or you or us together can investigate the behavior in a Debug build of CADET-Core. Regardless of who investigates, the parameter values would be very interesting to us to add to our test-suite and ensure future releases of CADET-Core do not suffer from this instability.
  2. The simulation doesn’t get stuck locally. Then we’ll need to look into why it get’s stuck when run in the meaply parallelized backend.

Hey I tested the parameter set that caused of my optimizations to stall on the high power computing cluster (HPC) and it worked fine locally. I asked one of the support people for the high HPC if he knows what is going on.

Here is what he said in a previous email about how meaply paralyzes.

When you ask mealpy to use parallelism on a solver, it uses the concurrent.futures thread- or process-pool helper – you’re requesting the ThreadPoolExecutor. With the ThreadPoolExecutor, the spawned threads are subject to the Global Interpreter Lock (GIL) and that can cause deadlocks. With the ProcessPoolExecutor (mode=“process”) the multiprocessing module is used, and you can run into problems if your code is not structured properly.

Best,
Daniel

I looked through the code you shared with me and found only one part that could deadlock due to the GIL and that one is very, very unlikely. The part is the generation of tempfile-names.

Currently this uses the tempfile library:

from CADETProcess.simulator.cadetAdapter line 176:

def get_tempfile_name(self):
    f = next(tempfile._get_candidate_names())
    return self.temp_dir / f'{f}.h5'

These generate 8 character file names drawing from 36 characters (a-z + 0-9). With a probability of 3.5447042e-13, these could generate the same name. If they did, two cadet-cli’s would try to access the same file and one should crash due to a permission error. But that wouldn’t lead to it getting stuck in integrate.

I don’t currently know of a way in which the GIL could lead to cadet-cli getting stuck within the integrate loop. (That doesn’t mean it can’t happen, only I don’t know how to search for it without an MRE (minimum reproducible example)).

I also tried running your code and it completed multiple batches without getting stuck, so I don’t have any further leads. Again, I need an MRE.

To get to an MRE we’d need to eliminate randomness during the optimizer by setting a set seed in model.solve and I need more information on the operating platform on the HPC, such as os, CPU architecture, and conda environment package list.

1 Like

It does happen very rarely for me. I just got done with a run today and it only happened once in the 5 epochs. I also did another run with almost the same code expect I fixed a mistake where I did not import a class in Fit_LSHADE_PCC_V2 and also specified a minimum cycle amount in PCC functions.

I did notice in another code where I was testing the amount of cycles it took to reach steady state that the n_cycles_max only worked if n_cycles_min was specified.
Here is the output when min cycle is not specified


output = Yield, Productivity, Resin Capacity Utilization, Max Cycle sent to cadet, Actual cycle number from the sim_results

Note: V11 is the code that had the one case of stalling, V12 had no cases. To look at the output of the console you can just look at the .log file or if you want to see the outputs from mealpy included the .out file with the bunch of numbers in its name.
Caviness_PCC_Opti_V11_Output_N2.zip (762.5 KB)
Caviness_PCC_Opti_V12_Output_N1.zip (826.5 KB)

I can confirm that the parameter combination that stalled a simulation in V11:

PCC_Params_static = [15, 1, 30, 1]
PCC_Params_var = [50., 0.25, 3.67044556, 1.97656121]
PCC_Obj_Processing(experiment, PCC_Params_var, PCC_Params_static, Yield_CU, ProA)

works without problems on CADET v5.0.0 - v5.0.3 on Windows and Linux. That is a very puzzling bug. If you can continue your work with the current timeout-based workaround, I’ll put research on this on hold until we get more information.

Regarding your other question: The default value for n_cycles_min is 5, and that is also the “batch” number of cycles that are started at a time. So it will first evaluate 5 cycles, test if stationarity has been reached or if n_cycles_max has been crossed, then simulate another 5 cycles and check again, etc.
Each communication between CADET-Core and CADET-Process takes time, so a certain level of grouping multiple cycles into a batch is advantageous. However, for your use-case, where n_cycles_max is very low, setting n_cycles_min lower (1 or 2) will be necessary. Hope that helps. Let me know if you have any further questions.

1 Like