Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
56 commits
Select commit Hold shift + click to select a range
622928c
basic fa2
futz12 Apr 24, 2026
ad7e388
basic fp32
futz12 Apr 25, 2026
9282279
avx512 tail opt
futz12 Apr 25, 2026
5abaa84
gemv for decode
futz12 Apr 25, 2026
42d331d
permute loop
futz12 Apr 25, 2026
85c17e6
opt softmax
futz12 Apr 25, 2026
2269ed6
opt
futz12 Apr 25, 2026
7c29bda
disable bf16s
futz12 Apr 25, 2026
d12e346
fix multithread
futz12 Apr 25, 2026
b39eb23
fix cache
futz12 Apr 25, 2026
faec20b
opt cache
futz12 Apr 26, 2026
dc2de4b
Opt layout
futz12 Apr 26, 2026
9bfd732
opt for cache
futz12 Apr 26, 2026
22b6dce
opt for lagre dim
futz12 Apr 26, 2026
5de27cb
import dim 512 spec
futz12 Apr 26, 2026
61b375a
slim kernel
futz12 Apr 27, 2026
66c0b68
slim sdpa
futz12 Apr 28, 2026
1f1b8a7
simd vec
futz12 Apr 28, 2026
c7c7d1d
remove unless specialization
futz12 Apr 28, 2026
274f6d5
apply code-format changes
futz12 Apr 28, 2026
1473d63
split kv
futz12 Apr 29, 2026
cdd69dd
prefetch opt for decode
futz12 Apr 29, 2026
5a0a879
opt for gemv
futz12 Apr 29, 2026
ce8b73f
support perf int8
futz12 Apr 29, 2026
bd7ea9c
apply code-format changes
futz12 Apr 29, 2026
69d873b
basic int8
futz12 May 1, 2026
c654a40
supoort int8
futz12 May 3, 2026
61440a7
Merge branch 'sdpa-opt-flashattn-x86' of https://github.com/futz12/nc…
futz12 May 3, 2026
ccfb1c8
apply code-format changes
futz12 May 3, 2026
458f55c
support perf int8
futz12 May 3, 2026
586c35d
Merge branch 'sdpa-opt-flashattn-x86' of https://github.com/futz12/nc…
futz12 May 3, 2026
702a664
opt bf16s
futz12 May 4, 2026
aab48b5
perf: vectorize int8 decode/prefill softmax+mask, refactor sdpa_decod…
futz12 May 4, 2026
285dee7
perf: vectorize int8 prefill softmax, use decode_mask_vec consistentl…
futz12 May 4, 2026
2dfb566
perf: vectorize int8 prefill scale ops, fix pv_gemm register pressure…
futz12 May 4, 2026
397412d
perf: add K-row software prefetch in qk_gemm_specialized_tiled_avx512…
futz12 May 4, 2026
f0737a8
perf: add group_parallel path to BF16 decode for MQA/GQA thread utili…
futz12 May 4, 2026
b448d0d
refactor: remove 4 *_dispatch thin wrappers, call underlying function…
futz12 May 4, 2026
7677339
refactor: extract FP32/BF16 prefill logic from forward() into sdpa_fo…
futz12 May 4, 2026
9953352
refactor: remove duplicate BLOCK_N=128 in forward() decode path
futz12 May 4, 2026
3d065c2
refactor: sdpa_decode now calls sdpa_decode_chunk, eliminating ~45 li…
futz12 May 4, 2026
596884f
refactor: extract sdpa_int8_decode_core helper, eliminate ~60 lines o…
futz12 May 4, 2026
da81931
refactor(sdpa_x86): extract top-level decode/prefill path functions f…
futz12 May 4, 2026
d14dea8
perf(sdpa_x86): optimize MQA prefill for small seqlen
futz12 May 4, 2026
2b2965a
perf(sdpa_x86): fix MQA/GQA prefill regressions and optimize small se…
futz12 May 5, 2026
f39f525
apply code-format changes
futz12 May 5, 2026
02e8f84
Merge branch 'master' into sdpa-opt-flashattn-x86
futz12 Jun 2, 2026
155d7e4
x86: improve sdpa prefill gqa large-dim path
futz12 Jun 2, 2026
4b1d090
x86: avoid sdpa prefill bf16 query roundtrip
futz12 Jun 2, 2026
7818da3
x86: avoid unused sdpa prefill q batch workspace
futz12 Jun 2, 2026
c375f8f
x86: hoist sdpa avx512 qk tail mask
futz12 Jun 2, 2026
7d5cfae
x86: hoist sdpa avx512 pv prefetch check
futz12 Jun 2, 2026
3ebc916
x86: avoid sdpa prefill mha q packing
futz12 Jun 2, 2026
99ec4db
x86: avoid sdpa prefill softmax max copy
futz12 Jun 2, 2026
4cf724f
optimize x86 sdpa prefill kernels
futz12 Jun 3, 2026
32c0e6b
apply code-format changes
futz12 Jun 3, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions ggml
Submodule ggml added at 8be60f
Loading