Hey,
I am using mealpy for my PCC optimization and I am having problems with the optimization stalling. Mealpy handles the parallelization and I think that might were things are going wrong. I talked high power computing cluster support people and here is what they found:
I’ve been watching on r04n00 and your Python script starts out running as expected:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
65370 dhaught 20 0 3194440 414208 84220 S 113.7 0.2 22:37.39 python
It was using barely over 100% of a CPU core, which upon investigation was a single computational thread and additional threads (garbage collection, etc.) that were spinning waiting to be woken-up by the interpreter. Eventually, 32 cadet-cli processes appeared along with 32 multiprocessing threads:
(gdb) info threads
Id Target Id Frame
- 1 Thread 0x2b38a58b98c0 (LWP 65370) “python” 0x00002b38a5ab7adb in do_futex_wait.constprop ()
from /lib64/libpthread.so.0
2 Thread 0x2b38c1600700 (LWP 65373) “jemalloc_bg_thd” 0x00002b38a5ab5965 in pthread_cond_wait@@GLIBC_2.3.2 ()
from /lib64/libpthread.so.0
3 Thread 0x2b38c293e700 (LWP 65374) “python” 0x00002b38a5ab5965 in pthread_cond_wait@@GLIBC_2.3.2 ()
from /lib64/libpthread.so.0
4 Thread 0x2b38c2b3f700 (LWP 65375) “python” 0x00002b38a5ab5965 in pthread_cond_wait@@GLIBC_2.3.2 ()
from /lib64/libpthread.so.0
5 Thread 0x2b38c2d40700 (LWP 65376) “python” 0x00002b38a5ab5965 in pthread_cond_wait@@GLIBC_2.3.2 ()
from /lib64/libpthread.so.0
6 Thread 0x2b38c2f41700 (LWP 65377) “python” 0x00002b38a5ab5965 in pthread_cond_wait@@GLIBC_2.3.2 ()
from /lib64/libpthread.so.0
7 Thread 0x2b38c3142700 (LWP 65378) “python” 0x00002b38a5ab5965 in pthread_cond_wait@@GLIBC_2.3.2 ()
from /lib64/libpthread.so.0
8 Thread 0x2b38c3343700 (LWP 65379) “python” 0x00002b38a5ab5965 in pthread_cond_wait@@GLIBC_2.3.2 ()
from /lib64/libpthread.so.0
9 Thread 0x2b38c3544700 (LWP 65380) “python” 0x00002b38a5ab5965 in pthread_cond_wait@@GLIBC_2.3.2 ()
from /lib64/libpthread.so.0
10 Thread 0x2b38c3745700 (LWP 65381) “python” 0x00002b38a5ab5965 in pthread_cond_wait@@GLIBC_2.3.2 ()
from /lib64/libpthread.so.0
11 Thread 0x2b38dbd87700 (LWP 65651) “python” 0x00002b38a64c238d in poll () from /lib64/libc.so.6
12 Thread 0x2b38dbf88700 (LWP 65652) “python” 0x00002b38a64c238d in poll () from /lib64/libc.so.6
13 Thread 0x2b38e4200700 (LWP 65653) “python” 0x00002b38a64c238d in poll () from /lib64/libc.so.6
14 Thread 0x2b38e4401700 (LWP 65654) “python” 0x00002b38a64c238d in poll () from /lib64/libc.so.6
15 Thread 0x2b38e4602700 (LWP 65655) “python” 0x00002b38a64c238d in poll () from /lib64/libc.so.6
16 Thread 0x2b38e4803700 (LWP 65656) “python” 0x00002b38a64c238d in poll () from /lib64/libc.so.6
17 Thread 0x2b38e4a04700 (LWP 65657) “python” 0x00002b38a64c238d in poll () from /lib64/libc.so.6
18 Thread 0x2b38e4c05700 (LWP 65658) “python” 0x00002b38a64c238d in poll () from /lib64/libc.so.6
19 Thread 0x2b38e4e06700 (LWP 65659) “python” 0x00002b38a64c238d in poll () from /lib64/libc.so.6
20 Thread 0x2b38e5007700 (LWP 65660) “python” 0x00002b38a64c238d in poll () from /lib64/libc.so.6
21 Thread 0x2b38e5208700 (LWP 65661) “python” 0x00002b38a64c238d in poll () from /lib64/libc.so.6
22 Thread 0x2b38e5409700 (LWP 65677) “python” 0x00002b38a64c238d in poll () from /lib64/libc.so.6
23 Thread 0x2b38e560a700 (LWP 65678) “python” 0x00002b38a64c238d in poll () from /lib64/libc.so.6
24 Thread 0x2b38e580b700 (LWP 65679) “python” 0x00002b38a64c238d in poll () from /lib64/libc.so.6
25 Thread 0x2b38e5a0c700 (LWP 65681) “python” 0x00002b38a64c238d in poll () from /lib64/libc.so.6
26 Thread 0x2b38e5c0d700 (LWP 65682) “python” 0x00002b38a64c238d in poll () from /lib64/libc.so.6
27 Thread 0x2b38e5e0e700 (LWP 65683) “python” 0x00002b38a64c238d in poll () from /lib64/libc.so.6
28 Thread 0x2b38e600f700 (LWP 65684) “python” 0x00002b38a64c238d in poll () from /lib64/libc.so.6
29 Thread 0x2b38e6210700 (LWP 65685) “python” 0x00002b38a64c238d in poll () from /lib64/libc.so.6
30 Thread 0x2b38e6411700 (LWP 65686) “python” 0x00002b38a64c238d in poll () from /lib64/libc.so.6
31 Thread 0x2b38e6612700 (LWP 65687) “python” 0x00002b38a64c238d in poll () from /lib64/libc.so.6
32 Thread 0x2b38e6813700 (LWP 65688) “python” 0x00002b38a64c238d in poll () from /lib64/libc.so.6
33 Thread 0x2b38e6a14700 (LWP 65689) “python” 0x00002b38a64c238d in poll () from /lib64/libc.so.6
34 Thread 0x2b38e6c15700 (LWP 65690) “python” 0x00002b38a64c238d in poll () from /lib64/libc.so.6
35 Thread 0x2b38e6e16700 (LWP 65692) “python” 0x00002b38a64c238d in poll () from /lib64/libc.so.6
36 Thread 0x2b38e7017700 (LWP 65693) “python” 0x00002b38a64c238d in poll () from /lib64/libc.so.6
37 Thread 0x2b38e7218700 (LWP 65694) “python” 0x00002b38a64c238d in poll () from /lib64/libc.so.6
38 Thread 0x2b38e7419700 (LWP 65695) “python” 0x00002b38a64c238d in poll () from /lib64/libc.so.6
39 Thread 0x2b38e761a700 (LWP 65697) “python” 0x00002b38a64c238d in poll () from /lib64/libc.so.6
40 Thread 0x2b38e781b700 (LWP 65698) “python” 0x00002b38a64c238d in poll () from /lib64/libc.so.6
41 Thread 0x2b38e7a1c700 (LWP 65699) “python” 0x00002b38a64c238d in poll () from /lib64/libc.so.6
42 Thread 0x2b38e7c21700 (LWP 65700) “python” 0x00002b38a64c238d in poll () from /lib64/libc.so.6
UID PID PPID C STIME TTY TIME CMD
dhaught 65311 65301 0 13:21 ? 00:00:00 /bin/bash -l /var/spool/slurm/job29471921/slurm_script
dhaught 65370 65311 98 13:21 ? 00:27:33 python Fit_LSHADE_PCC_V2.py
dhaught 67802 65370 99 13:49 ? 00:00:18 /home/3563/.conda/cadet/11_12_2024_CadetP_Core5/bin/cadet-cli /wor
dhaught 67836 65370 99 13:49 ? 00:00:17 /home/3563/.conda/cadet/11_12_2024_CadetP_Core5/bin/cadet-cli /wor
dhaught 67868 65370 99 13:49 ? 00:00:17 /home/3563/.conda/cadet/11_12_2024_CadetP_Core5/bin/cadet-cli /wor
dhaught 67900 65370 99 13:49 ? 00:00:16 /home/3563/.conda/cadet/11_12_2024_CadetP_Core5/bin/cadet-cli /wor
dhaught 67932 65370 89 13:49 ? 00:00:13 /home/3563/.conda/cadet/11_12_2024_CadetP_Core5/bin/cadet-cli /wor
dhaught 67933 65370 99 13:49 ? 00:00:15 /home/3563/.conda/cadet/11_12_2024_CadetP_Core5/bin/cadet-cli /wor
dhaught 67996 65370 99 13:49 ? 00:00:15 /home/3563/.conda/cadet/11_12_2024_CadetP_Core5/bin/cadet-cli /wor
dhaught 67998 65370 99 13:49 ? 00:00:15 /home/3563/.conda/cadet/11_12_2024_CadetP_Core5/bin/cadet-cli /wor
dhaught 67999 65370 99 13:49 ? 00:00:15 /home/3563/.conda/cadet/11_12_2024_CadetP_Core5/bin/cadet-cli /wor
dhaught 68031 65370 78 13:49 ? 00:00:11 /home/3563/.conda/cadet/11_12_2024_CadetP_Core5/bin/cadet-cli /wor
dhaught 68051 65370 99 13:49 ? 00:00:15 /home/3563/.conda/cadet/11_12_2024_CadetP_Core5/bin/cadet-cli /wor
dhaught 68064 65370 99 13:49 ? 00:00:15 /home/3563/.conda/cadet/11_12_2024_CadetP_Core5/bin/cadet-cli /wor
dhaught 68065 65370 99 13:49 ? 00:00:15 /home/3563/.conda/cadet/11_12_2024_CadetP_Core5/bin/cadet-cli /wor
dhaught 68066 65370 90 13:49 ? 00:00:13 /home/3563/.conda/cadet/11_12_2024_CadetP_Core5/bin/cadet-cli /wor
dhaught 68084 65370 99 13:49 ? 00:00:15 /home/3563/.conda/cadet/11_12_2024_CadetP_Core5/bin/cadet-cli /wor
dhaught 68192 65370 99 13:49 ? 00:00:15 /home/3563/.conda/cadet/11_12_2024_CadetP_Core5/bin/cadet-cli /wor
dhaught 68193 65370 99 13:49 ? 00:00:15 /home/3563/.conda/cadet/11_12_2024_CadetP_Core5/bin/cadet-cli /wor
dhaught 68243 65370 99 13:49 ? 00:00:15 /home/3563/.conda/cadet/11_12_2024_CadetP_Core5/bin/cadet-cli /wor
dhaught 68258 65370 99 13:49 ? 00:00:15 /home/3563/.conda/cadet/11_12_2024_CadetP_Core5/bin/cadet-cli /wor
dhaught 68290 65370 99 13:49 ? 00:00:15 /home/3563/.conda/cadet/11_12_2024_CadetP_Core5/bin/cadet-cli /wor
dhaught 68343 65370 99 13:49 ? 00:00:14 /home/3563/.conda/cadet/11_12_2024_CadetP_Core5/bin/cadet-cli /wor
dhaught 68354 65370 99 13:49 ? 00:00:15 /home/3563/.conda/cadet/11_12_2024_CadetP_Core5/bin/cadet-cli /wor
dhaught 68355 65370 99 13:49 ? 00:00:14 /home/3563/.conda/cadet/11_12_2024_CadetP_Core5/bin/cadet-cli /wor
dhaught 68356 65370 99 13:49 ? 00:00:15 /home/3563/.conda/cadet/11_12_2024_CadetP_Core5/bin/cadet-cli /wor
dhaught 68388 65370 99 13:49 ? 00:00:14 /home/3563/.conda/cadet/11_12_2024_CadetP_Core5/bin/cadet-cli /wor
dhaught 68389 65370 99 13:49 ? 00:00:14 /home/3563/.conda/cadet/11_12_2024_CadetP_Core5/bin/cadet-cli /wor
dhaught 68421 65370 99 13:49 ? 00:00:15 /home/3563/.conda/cadet/11_12_2024_CadetP_Core5/bin/cadet-cli /wor
dhaught 68500 65370 99 13:49 ? 00:00:15 /home/3563/.conda/cadet/11_12_2024_CadetP_Core5/bin/cadet-cli /wor
dhaught 68501 65370 99 13:49 ? 00:00:15 /home/3563/.conda/cadet/11_12_2024_CadetP_Core5/bin/cadet-cli /wor
dhaught 68502 65370 99 13:49 ? 00:00:15 /home/3563/.conda/cadet/11_12_2024_CadetP_Core5/bin/cadet-cli /wor
dhaught 68549 65370 99 13:49 ? 00:00:15 /home/3563/.conda/cadet/11_12_2024_CadetP_Core5/bin/cadet-cli /wor
dhaught 68574 65370 99 13:49 ? 00:00:15 /home/3563/.conda/cadet/11_12_2024_CadetP_Core5/bin/cadet-cli /wor
dhaught 65371 65311 0 13:21 ? 00:00:00 tee output.log
Additional cadet-cli processes (totaling the number of data points) were started and ended until all cadet-cli processes were completed and it was back to just the Python process, right up until a single cadet-cli was started:
UID PID PPID C STIME TTY TIME CMD
dhaught 65311 65301 0 13:21 ? 00:00:00 /bin/bash -l /var/spool/slurm/job29471921/slurm_script
dhaught 65370 65311 92 13:21 ? 00:34:40 python Fit_LSHADE_PCC_V2.py
dhaught 65371 65311 0 13:21 ? 00:00:00 tee output.log
UID PID PPID C STIME TTY TIME CMD
dhaught 65311 65301 0 13:21 ? 00:00:00 /bin/bash -l /var/spool/slurm/job29471921/slurm_script
dhaught 65370 65311 92 13:21 ? 00:34:42 python Fit_LSHADE_PCC_V2.py
dhaught 71072 65370 0 13:59 ? 00:00:04 /home/3563/.conda/cadet/11_12_2024_CadetP_Core5/bin/cadet-cli /wor
dhaught 65371 65311 0 13:21 ? 00:00:00 tee output.log
On this pass, there were only ever 7 cadet-cli programs running and there are at most 7 HDF5 files in ./tmp at all times. Eventually this winds down to just a single cadet-cli program, and that one has been running since ca. 15:30 yesterday. I attached to it with gdb and set a few breakpoints to isolate what it’s doing:
(gdb) bt
#0 0x00002af4f61fbd25 in IDAStep () from /home/3563/.conda/cadet/11_12_2024_CadetP_Core5/bin/…/lib/libcadet.so.0
#1 0x00002af4f6201e1d in IDASolve () from /home/3563/.conda/cadet/11_12_2024_CadetP_Core5/bin/…/lib/libcadet.so.0
#2 0x00002af4f5d1647b in cadet::Simulator::integrate() ()
from /home/3563/.conda/cadet/11_12_2024_CadetP_Core5/bin/…/lib/libcadet.so.0
#3 0x00005586e34a61af in main ()
The program keeps entering and exiting the [IDASolve(), IDAStep(), …] sequence of functions but has been inside cadet::Simulator::integrate() the whole time. I would have to surmise that there are some numerical stability or convergence issues that are preventing the function from reaching a conclusion. If you can enable greater verbosity in the execution of the cadet-cli program, that might help.
I was wondering if yall know what is going on?
I am using this CADET Porcess :GitHub - angela-moser/CADET-Process: A Framework for Modelling and Optimizing Advanced Chromatographic Processes
I also am using CADET Core 5.0
Here are the files from the HPC run
Caviness_PCC_Opti_V10.zip (1008.3 KB)