[`Fp8`] Fix experts #43154

vasqu · 2026-01-07T19:20:51Z

As per title, the current fp8 experts implementation is wrong - likely related to #42456 (cc @3outeille)

Before:

The number of experts was treated as the number of top k experts that are hit, i.e. num_experts = top_k_weights.shape[1]
- This is simply wrong as now our indices which can cover all experts is limited to top_k
- This can trigger device assert errors as one hot might not have the correct indices for the number of classes indicated by num_experts
The hit expert is now used which is correct if we passed the selection before
- Only due to the indices being correctly limited

After:

We now follow the current eager implementation, e.g. Mixtral:

transformers/src/transformers/models/mixtral/modeling_mixtral.py

Lines 82 to 86 in 88a5623

    
           final_hidden_states = torch.zeros_like(hidden_states) 
        
           with torch.no_grad(): 
        
               expert_mask = torch.nn.functional.one_hot(top_k_index, num_classes=self.num_experts) 
        
               expert_mask = expert_mask.permute(2, 1, 0) 
        
               expert_hit = torch.greater(expert_mask.sum(dim=(-1, -2)), 0).nonzero()

The portion where we iterate through experts should only have changed the linear calls of

transformers/src/transformers/models/mixtral/modeling_mixtral.py

Lines 88 to 98 in 88a5623

    
           for expert_idx in expert_hit: 
        
               expert_idx = expert_idx[0] 
        
               if expert_idx == self.num_experts: 
        
                   continue 
        
               top_k_pos, token_idx = torch.where(expert_mask[expert_idx]) 
        
               current_state = hidden_states[token_idx] 
        
               gate, up = nn.functional.linear(current_state, self.gate_up_proj[expert_idx]).chunk(2, dim=-1) 
        
               current_hidden_states = self.act_fn(gate) * up 
        
               current_hidden_states = nn.functional.linear(current_hidden_states, self.down_proj[expert_idx]) 
        
               current_hidden_states = current_hidden_states * top_k_weights[token_idx, top_k_pos, None] 
        
               final_hidden_states.index_add_(0, token_idx, current_hidden_states.to(final_hidden_states.dtype))

The key is to redirect the correctly hit expert to the top k weight via

transformers/src/transformers/models/mixtral/modeling_mixtral.py

Line 92 in 88a5623

top_k_pos, token_idx = torch.where(expert_mask[expert_idx])

I'm not sure how long it has been this way, but it definitely is not correct atm. I'd be curious if I'm overseeing something here but without this fix #42028 will not work for the original model - my suspicion is on mixtral having a bias towards low indices which might have saved this oversight.

HuggingFaceDocBuilderDev · 2026-01-07T19:35:24Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker

Yes absolutely! I did not check but if a test is missing let's add one!

ArthurZucker · 2026-01-08T09:31:52Z

@3outeille note that this breaks TP but same as it is broken for now as well

SunMarc

LGTM

SunMarc · 2026-01-08T13:38:50Z

src/transformers/integrations/finegrained_fp8.py

        top_k_weights: torch.Tensor,
    ) -> torch.Tensor:
        final_hidden_states = torch.zeros_like(hidden_states)
-        num_experts = top_k_weights.shape[1]
        with torch.no_grad():
-            expert_mask = torch.nn.functional.one_hot(top_k_index, num_classes=num_experts + 1)
+            expert_mask = torch.nn.functional.one_hot(top_k_index, num_classes=self.num_experts)
            expert_mask = expert_mask.permute(2, 1, 0)
            expert_hit = torch.greater(expert_mask.sum(dim=(-1, -2)), 0).nonzero()

        for expert_idx in expert_hit:
            expert_idx = expert_idx[0]
-            if expert_idx == num_experts:
+            if expert_idx == self.num_experts:
                continue
-            _, token_idx = torch.where(expert_mask[expert_idx])
-            current_state = hidden_states.index_select(0, token_idx)
+            top_k_pos, token_idx = torch.where(expert_mask[expert_idx])
+            current_state = hidden_states[token_idx]
            gate, up = self.linear(
                current_state, self.gate_up_proj[expert_idx], self.gate_up_proj_scale_inv[expert_idx]


thanks ! maybe we can add a small comment to say that it was mostly copied from deepspeed_v3 modeling, so that we should propagate the changes here also in the future

Added a comment, referencing mixtral

github-actions · 2026-01-08T16:01:22Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: finegrained_fp8

vasqu · 2026-01-08T16:05:01Z

tests/quantization/finegrained_fp8/test_fp8.py

+    @unittest.skip(reason="Dependent on #42028, will be removed alongside that PR")
+    def test_quantized_moe_forward(self):


This acts as a sanity integration check but it depends on the minimax m2 PR (#42028) so I will remove this skip when merging that PR

I think this is the easiest way as these weights force the issue

github-actions · 2026-01-08T16:28:08Z

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=43154&sha=d0d860

fix fp8 experts

125115c

vasqu requested review from ArthurZucker, MekkCyber and SunMarc January 7, 2026 19:21

sync index select to base indexing

d426e07

vasqu mentioned this pull request Jan 7, 2026

Add support for MiniMax-M2 #42028

Merged

5 tasks

Merge branch 'main' into fix-fp8-experts

3635125

ArthurZucker approved these changes Jan 8, 2026

View reviewed changes

SunMarc approved these changes Jan 8, 2026

View reviewed changes

vasqu and others added 2 commits January 8, 2026 17:00

add test and comment; test will be enabled with the minimax PR

b3f4c52

Merge branch 'main' into fix-fp8-experts

3b3acdd

style

2619c96

vasqu commented Jan 8, 2026

View reviewed changes

Merge branch 'main' into fix-fp8-experts

d0d8606

vasqu merged commit 9255982 into huggingface:main Jan 8, 2026
25 checks passed

vasqu deleted the fix-fp8-experts branch January 8, 2026 16:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[`Fp8`] Fix experts #43154

[`Fp8`] Fix experts #43154

Uh oh!

vasqu commented Jan 7, 2026 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Jan 7, 2026

Uh oh!

ArthurZucker left a comment

Uh oh!

ArthurZucker commented Jan 8, 2026

Uh oh!

SunMarc left a comment

Uh oh!

SunMarc Jan 8, 2026

Uh oh!

vasqu Jan 8, 2026

Uh oh!

github-actions bot commented Jan 8, 2026

Uh oh!

vasqu Jan 8, 2026

Uh oh!

github-actions bot commented Jan 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	final_hidden_states = torch.zeros_like(hidden_states)
	with torch.no_grad():
	expert_mask = torch.nn.functional.one_hot(top_k_index, num_classes=self.num_experts)
	expert_mask = expert_mask.permute(2, 1, 0)
	expert_hit = torch.greater(expert_mask.sum(dim=(-1, -2)), 0).nonzero()

	for expert_idx in expert_hit:
	expert_idx = expert_idx[0]
	if expert_idx == self.num_experts:
	continue
	top_k_pos, token_idx = torch.where(expert_mask[expert_idx])
	current_state = hidden_states[token_idx]
	gate, up = nn.functional.linear(current_state, self.gate_up_proj[expert_idx]).chunk(2, dim=-1)
	current_hidden_states = self.act_fn(gate) * up
	current_hidden_states = nn.functional.linear(current_hidden_states, self.down_proj[expert_idx])
	current_hidden_states = current_hidden_states * top_k_weights[token_idx, top_k_pos, None]
	final_hidden_states.index_add_(0, token_idx, current_hidden_states.to(final_hidden_states.dtype))

		@unittest.skip(reason="Dependent on #42028, will be removed alongside that PR")
		def test_quantized_moe_forward(self):

[Fp8] Fix experts #43154

[Fp8] Fix experts #43154

Uh oh!

Conversation

vasqu commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Jan 7, 2026

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

ArthurZucker commented Jan 8, 2026

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

SunMarc Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

vasqu Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jan 8, 2026

Uh oh!

vasqu Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jan 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[`Fp8`] Fix experts #43154

[`Fp8`] Fix experts #43154

vasqu commented Jan 7, 2026 •

edited

Loading