-
Notifications
You must be signed in to change notification settings - Fork 682
Open
Labels
Description
I see occasional failures restoring a set of processes in podman.
The symptom is a timeout. Some debugging shows that it is hanging here:
(gdb) bt
#0 0x0000ffff95607be4 in syscall () from /lib/aarch64-linux-gnu/libc.so.6
#1 0x0000aaaae8a1fc64 in sys_futex (addr2=0x0, val3=0, timeout=0xffffc4786258, val1=<optimized out>, op=0,
addr1=0xffff9595b00c) at include/common/lock.h:29
#2 __restore_wait_inprogress_tasks (participants=participants@entry=0) at criu/cr-restore.c:182
#3 0x0000aaaae8a21078 in restore_wait_inprogress_tasks () at criu/cr-restore.c:194
#4 restore_switch_stage (next_stage=5) at criu/cr-restore.c:224
#5 restore_root_task (init=<optimized out>) at criu/cr-restore.c:2213
#6 0x0000aaaae8a220fc in cr_restore_tasks () at criu/cr-restore.c:2417
#7 0x0000aaaae8a27554 in restore_using_req (req=<optimized out>, sk=3) at criu/cr-service.c:889
#8 cr_service_work (sk=3) at criu/cr-service.c:1365
#9 0x0000aaaae89f5f3c in main (argc=3, argv=0xffffc4786758, envp=<optimized out>) at criu/crtools.c:191
(gdb) up
#1 0x0000aaaae8a1fc64 in sys_futex (addr2=0x0, val3=0, timeout=0xffffc4786258, val1=<optimized out>, op=0,
addr1=0xffff9595b00c) at include/common/lock.h:29
29 include/common/lock.h: No such file or directory.
(gdb)
#2 __restore_wait_inprogress_tasks (participants=participants@entry=0) at criu/cr-restore.c:182
182 criu/cr-restore.c: No such file or directory.
(gdb) p task_entries->nr_in_progress
Cannot access memory at address 0xaaaae8b5d1b0
(gdb) p &task_entries->nr_in_progress
Cannot access memory at address 0xaaaae8b5d1b0
the last lines in the restore.log are
(05.342893) pie: 134: restoring lsm profile (current) changeprofile containers-default-engflow
(05.343043) pie: 132: seccomp: Restored mode 2 on tid 132
(05.343086) pie: 132: restoring lsm profile (current) changeprofile containers-default-engflow
(I changed the profile name from its default.)
This happens occasionally on AWS ARM64 machines. We're running a set of machine types, the machine that has the above hang was a c6gd.2xlarge, cpuinfo
processor : 0
BogoMIPS : 243.75
Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp
CPU implementer : 0x41
CPU architecture: 8
CPU variant : 0x3
CPU part : 0xd0c
CPU revision : 1
the problem is machine specific: the exact same snapshot restores correctly on a different machine, but on the affected machine, the hang reproduces.
I am using locally modified version of
commit c61329b30387aa50634e794a4781dde64cb2a6c3
Author: Radostin Stoyanov <[email protected]>
Date: Sun May 11 11:33:29 2025 +0100
seize: fix pause devices for frozen containers
(the mod is a minor tweak to symlink the lazy pages socket and is unaffected). The same version has been working reliably on x64.