Skip to content

Restoring PACG on older ARM64 CPU hangs #2824

@hanwen-flow

Description

@hanwen-flow

I see occasional failures restoring a set of processes in podman.

The symptom is a timeout. Some debugging shows that it is hanging here:

(gdb) bt
#0  0x0000ffff95607be4 in syscall () from /lib/aarch64-linux-gnu/libc.so.6
#1  0x0000aaaae8a1fc64 in sys_futex (addr2=0x0, val3=0, timeout=0xffffc4786258, val1=<optimized out>, op=0, 
    addr1=0xffff9595b00c) at include/common/lock.h:29
#2  __restore_wait_inprogress_tasks (participants=participants@entry=0) at criu/cr-restore.c:182
#3  0x0000aaaae8a21078 in restore_wait_inprogress_tasks () at criu/cr-restore.c:194
#4  restore_switch_stage (next_stage=5) at criu/cr-restore.c:224
#5  restore_root_task (init=<optimized out>) at criu/cr-restore.c:2213
#6  0x0000aaaae8a220fc in cr_restore_tasks () at criu/cr-restore.c:2417
#7  0x0000aaaae8a27554 in restore_using_req (req=<optimized out>, sk=3) at criu/cr-service.c:889
#8  cr_service_work (sk=3) at criu/cr-service.c:1365
#9  0x0000aaaae89f5f3c in main (argc=3, argv=0xffffc4786758, envp=<optimized out>) at criu/crtools.c:191
(gdb) up
#1  0x0000aaaae8a1fc64 in sys_futex (addr2=0x0, val3=0, timeout=0xffffc4786258, val1=<optimized out>, op=0, 
    addr1=0xffff9595b00c) at include/common/lock.h:29
29	include/common/lock.h: No such file or directory.
(gdb) 
#2  __restore_wait_inprogress_tasks (participants=participants@entry=0) at criu/cr-restore.c:182
182	criu/cr-restore.c: No such file or directory.
(gdb) p task_entries->nr_in_progress
Cannot access memory at address 0xaaaae8b5d1b0
(gdb) p &task_entries->nr_in_progress
Cannot access memory at address 0xaaaae8b5d1b0

the last lines in the restore.log are

(05.342893) pie: 134: restoring lsm profile (current) changeprofile containers-default-engflow
(05.343043) pie: 132: seccomp: Restored mode 2 on tid 132
(05.343086) pie: 132: restoring lsm profile (current) changeprofile containers-default-engflow

(I changed the profile name from its default.)

This happens occasionally on AWS ARM64 machines. We're running a set of machine types, the machine that has the above hang was a c6gd.2xlarge, cpuinfo

processor	: 0
BogoMIPS	: 243.75
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp
CPU implementer	: 0x41
CPU architecture: 8
CPU variant	: 0x3
CPU part	: 0xd0c
CPU revision	: 1

the problem is machine specific: the exact same snapshot restores correctly on a different machine, but on the affected machine, the hang reproduces.

I am using locally modified version of

commit c61329b30387aa50634e794a4781dde64cb2a6c3
Author: Radostin Stoyanov <[email protected]>
Date:   Sun May 11 11:33:29 2025 +0100

    seize: fix pause devices for frozen containers

(the mod is a minor tweak to symlink the lazy pages socket and is unaffected). The same version has been working reliably on x64.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions