-
Notifications
You must be signed in to change notification settings - Fork 682
Description
At Google we are using CRIU to accelerate certain workloads by snapshotting a process and then restoring it millions of times on a fleet of heterogeneous x86 hardware. (This is a tiny subset of the Google fleet but it is also large enough that we cannot afford to be picky about particular CPU configurations.)
We are currently using CRIU 3.17.
We have had our first issue with CRIU dumping appear recently as that workload's fleet has added more and more Sapphire Rapids CPUs to the mix. Specifically: Sapphire Rapids appears to have an xsave struct too large for CRIU dump to handle (11008 vs 4096).
The relevant CRIU logs are:
Warn (compel/arch/x86/src/lib/cpu.c:136): cpu: fpu: max xsave frame exceed xsave_struct (11008 4096)
Error (compel/arch/x86/src/lib/infect.c:424): Can't set FPU registers for 306: Bad address
Error (compel/src/lib/infect.c:1472): Parasite exited with -1
Error (criu/parasite-syscall.c:213): Can't init thread in parasite 306
Error (criu/cr-dump.c:947): Can't dump thread for pid 306
Error (criu/cr-dump.c:1744): Can't dump threads
Error (criu/cr-dump.c:2108): Dumping FAILED.
From my own investigation, I've determined that the size of xsave_struct comes from the Linux kernel and is set at 4096.
I came across a random NetBSD patch that indicates Sapphire Rapids CPUs may have a maximum xsave space of 11008:
cpu0: xsave features 0x602e7<x87,SSE,AVX,Opmask,ZMM_Hi256,Hi16_ZMM,PKRU>
cpu0: xsave instructions 0x1f<XSAVEOPT,XSAVEC,XGETBV,XSAVES,XFD>
cpu0: xsave area size: current 2688, maximum 11008, xgetbv enabled
cpu0: enabled xsave 0xe7<x87,SSE,AVX,Opmask,ZMM_Hi256,Hi16_ZMM>
Unfortunately we do not have an easy way to force our CRIU workload onto a particular CPU architecture in our fleet, either while dumping or restoring, so it is time-consuming to reproduce this manually with instrumentation -- but it is possible through brute-force. So if its necessary for us to run additional commands to dump information about these problematic host machines, it is possible though might take a while.