Restoring PACG on older ARM64 CPU hangs

I see occasional failures restoring a set of processes in podman.

The symptom is a timeout. Some debugging shows that it is hanging here:

```
(gdb) bt
#0  0x0000ffff95607be4 in syscall () from /lib/aarch64-linux-gnu/libc.so.6
#1  0x0000aaaae8a1fc64 in sys_futex (addr2=0x0, val3=0, timeout=0xffffc4786258, val1=<optimized out>, op=0, 
    addr1=0xffff9595b00c) at include/common/lock.h:29
#2  __restore_wait_inprogress_tasks (participants=participants@entry=0) at criu/cr-restore.c:182
#3  0x0000aaaae8a21078 in restore_wait_inprogress_tasks () at criu/cr-restore.c:194
#4  restore_switch_stage (next_stage=5) at criu/cr-restore.c:224
#5  restore_root_task (init=<optimized out>) at criu/cr-restore.c:2213
#6  0x0000aaaae8a220fc in cr_restore_tasks () at criu/cr-restore.c:2417
#7  0x0000aaaae8a27554 in restore_using_req (req=<optimized out>, sk=3) at criu/cr-service.c:889
#8  cr_service_work (sk=3) at criu/cr-service.c:1365
#9  0x0000aaaae89f5f3c in main (argc=3, argv=0xffffc4786758, envp=<optimized out>) at criu/crtools.c:191
(gdb) up
#1  0x0000aaaae8a1fc64 in sys_futex (addr2=0x0, val3=0, timeout=0xffffc4786258, val1=<optimized out>, op=0, 
    addr1=0xffff9595b00c) at include/common/lock.h:29
29	include/common/lock.h: No such file or directory.
(gdb) 
#2  __restore_wait_inprogress_tasks (participants=participants@entry=0) at criu/cr-restore.c:182
182	criu/cr-restore.c: No such file or directory.
(gdb) p task_entries->nr_in_progress
Cannot access memory at address 0xaaaae8b5d1b0
(gdb) p &task_entries->nr_in_progress
Cannot access memory at address 0xaaaae8b5d1b0
```

the last lines in the restore.log are 

```
(05.342893) pie: 134: restoring lsm profile (current) changeprofile containers-default-engflow
(05.343043) pie: 132: seccomp: Restored mode 2 on tid 132
(05.343086) pie: 132: restoring lsm profile (current) changeprofile containers-default-engflow
```

(I changed the profile name from its default.)

This happens occasionally on AWS ARM64 machines. We're running a set of machine types, the machine that has the above hang was a c6gd.2xlarge, cpuinfo

```
processor	: 0
BogoMIPS	: 243.75
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp
CPU implementer	: 0x41
CPU architecture: 8
CPU variant	: 0x3
CPU part	: 0xd0c
CPU revision	: 1
```

the problem is machine specific: the exact same snapshot restores correctly on a different machine, but on the affected machine, the hang reproduces.

I am using locally modified version of 

```
commit c61329b30387aa50634e794a4781dde64cb2a6c3
Author: Radostin Stoyanov <rstoyanov@fedoraproject.org>
Date:   Sun May 11 11:33:29 2025 +0100

    seize: fix pause devices for frozen containers
```
(the mod is a minor tweak to symlink the lazy pages socket and is unaffected). The same version has been working reliably on x64.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Restoring PACG on older ARM64 CPU hangs #2824

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Restoring PACG on older ARM64 CPU hangs #2824

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions