Skip to content

Streamlining of Scaled Dot-Product Attention#12

Merged
iksnagreb merged 20 commits into
devfrom
feature/attention-streamline
Feb 6, 2025
Merged

Streamlining of Scaled Dot-Product Attention#12
iksnagreb merged 20 commits into
devfrom
feature/attention-streamline

Conversation

@iksnagreb

@iksnagreb iksnagreb commented Jan 20, 2025

Copy link
Copy Markdown

iksnagreb added 17 commits April 3, 2024 15:12
Flips the order of AbsorbSignBiasIntoMultiThreshold and
MoveScalarLinearPastInvariants streamlining transforms to prefer
absorbing adds into multi-thresholds instead of propagating them
downwards. This should prevent accumulation of scalar adds in front of
two-input matmuls in scaled dot-product attention operators (they cannot
be moved past the matmul operation in that case).
The MoveScalarMulPastMatMul transformation can now handle matmul
operations with both inputs preceded by a scalar multiplication.

This change is required for streamlining scaled dot-product attention
operations, which are essentially two-input matmuls.
Assertions are to restrictive, causing the program to terminate in cases
the streamlining simply encounters nodes to which the transforms are not
applicable: Just skip those nodes.

Only the two transforms currently affecting the streamlining of scaled
dot-product attention have been changed.
This is pretty much copy and paste of the existing test case, just
replacing the MatMul initializer by a second top-input followed by a
scalar Mul.
Folding quantized initializers into add-like nodes did not repsect the
order of inputs to the add node correctly. This is fixed by testing for
one of the two possible orders and selecting the following indices
accordingly.

Shape inference following the transformation is fixed by deleting the
annotations instead of propagating them incorrectly. Deleting the shape
annotations should not hurt, as these are redone by running shape
inference after each transformation anyways.
Add is commutative and thus the export does not always generate the
initializer as the second input. However, this was always assumed by
this transformation, failing via assertion if the inputs were simply
ordered differently. The transformation now handles both of the two
possible input orderings.
This is required for streamlining packed input projections of multi-head
scaled dot-product attention. Adds support for Squeeze and Unsqueeze as
well. Skip moving of fork-node producers as this is not handled
correctly. However, the same effect can be attained by applying the
MoveLinearPastFork transformation first.
Explicitly rejects absorbing into fork-nodes. Previously, this probably
would have failed, silently resulting in a wrong model. Not sure whether
this happened in any practically relevant models?
This probably is still rather sketchy, but at least it tries to check
the data layout annotation. For now seems to be enough for getting the
thresholds of multi-head attention right, IF qonnx properly annotates
the 3D layouts.
@iksnagreb iksnagreb self-assigned this Jan 28, 2025
@iksnagreb iksnagreb requested a review from fpjentzsch January 28, 2025 20:11
@iksnagreb iksnagreb marked this pull request as ready for review January 28, 2025 20:13
@fpjentzsch

Copy link
Copy Markdown

@iksnagreb Could you fix the conflicts (especially the one in qonnx_activation_handlers.py) by merging dev into this?

@iksnagreb

Copy link
Copy Markdown
Author

Conflicts should be resolved now

@iksnagreb iksnagreb merged commit 1e3085f into dev Feb 6, 2025
LinusJungemann added a commit that referenced this pull request Jun 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants