Math: FFT: Optimizations with Xtensa HiFi code#10637
Math: FFT: Optimizations with Xtensa HiFi code#10637singalsu wants to merge 2 commits intothesofproject:mainfrom
Conversation
6acb7ad to
0fef7d3
Compare
There was a problem hiding this comment.
Pull request overview
This PR refactors the multi-radix FFT implementation by moving the generic multi-FFT logic out of fft_multi.c and introducing a HiFi3-optimized implementation using Xtensa intrinsics, plus a small HiFi3 FFT kernel optimization.
Changes:
- Split
dft3_32()andfft_multi_execute_32()out offft_multi.cinto new generic and HiFi3-specific source files. - Add a new HiFi3-optimized multi-FFT implementation (
fft_multi_hifi3.c) using packed complex arithmetic and fused MAC operations. - Optimize the HiFi3 32-bit FFT kernel by special-casing the first stage and skipping the twiddle multiply for
j=0.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| test/cmocka/src/math/fft/CMakeLists.txt | Updates unit test build inputs to compile the new split multi-FFT sources. |
| src/math/fft/fft_multi_hifi3.c | Adds HiFi3-optimized dft3_32() and fft_multi_execute_32() implementations. |
| src/math/fft/fft_multi_generic.c | Adds the generic dft3_32() and fft_multi_execute_32() implementations previously in fft_multi.c. |
| src/math/fft/fft_multi.c | Removes dft3_32()/fft_multi_execute_32() from this file, leaving plan allocation/free + twiddle table inclusion. |
| src/math/fft/fft_32_hifi3.c | Optimizes HiFi3 FFT stage execution (skip twiddle multiply in first stage and for j=0). |
| src/math/fft/CMakeLists.txt | Adds the new multi-FFT source files to the build when CONFIG_MATH_FFT_MULTI is enabled. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
0fef7d3 to
9e70d76
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| } | ||
| /* advance pointers past current group's bottom half */ | ||
| top_ptr += n; | ||
| bot_ptr += n; |
There was a problem hiding this comment.
just to check - is this really correct? top_ptr and bot_ptr are incremented n times inside the loop, and now n is added to them again?
There was a problem hiding this comment.
Yes it is correct, for the loop of k, the top is incremented by m. In the pointer arithmetic version the loop increments them one by one to n. But since n is m/2, need to increment other n.
9e70d76 to
21d771f
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
src/math/fft/fft_32_hifi3.c
Outdated
| sample1 = tw_r[0]; | ||
| sample2 = tw_i[0]; |
There was a problem hiding this comment.
Twiddle loading here assigns int32_t values directly into ae_int32x2 temporaries before packing with AE_SEL32_LH(). In other HiFi3 code paths (e.g. fft_multi_hifi3.c) scalar-to-vector uses AE_MOVDA32() to avoid relying on implicit conversions / lane contents. Please convert tw_r[0]/tw_i[0] using AE_MOVDA32() (or an equivalent intrinsic) before packing, so the real/imag lanes are well-defined and the complex multiply uses the intended coefficients.
| sample1 = tw_r[0]; | |
| sample2 = tw_i[0]; | |
| sample1 = AE_MOVDA32(tw_r[0]); | |
| sample2 = AE_MOVDA32(tw_i[0]); |
There was a problem hiding this comment.
I like the cleaner look of trivial operations like "=", adding those AE_MOVDA32() is same MCPS and probably same code, but true that it's defined in the manual to know what exactly happens without guessing.
There was a problem hiding this comment.
Using now AE_MOVDA32X2().
lgirdwood
left a comment
There was a problem hiding this comment.
Nice MCPS saving, good for me, but pls respond to copilot.
21d771f to
608b774
Compare
This patch optimizes the cycle count of the radix-2 Cooley-Tukey implementation with three changes: - Dedicated depth-1 stage: all N/2 butterflies use a real twiddle factor W^0 = 1+0j, so the complex multiply is replaced by plain add or subtract. - Skip multiply for j=0 in stages >= 2: The first butterfly in every group also uses W^0, saving an additional ~N/2 complex multiplications across all remaining stages. - Pointer arithmetic: replace per-butterfly index arithmetic (outx[k+j], outx[k+j+n], twiddle[i*j]) with auto-incrementing pointers and strided twiddle access (tw_r += stride), eliminating integer multiplies for address computation. This change saves 11 MCPS (from 74 MCPS to 63 MCPS) in STFT Process module in MTL platform with 1024/256 size/hop FFT processing. It was tested with scripts: scripts/rebuild-testbench.sh -p mtl scripts/sof-testbench-helper.sh -x -m stft_process_1024_256_ \ -p profile-stft_process.txt Signed-off-by: Seppo Ingalsuo <seppo.ingalsuo@linux.intel.com>
This patch adds HiFi3 versions for functions dft3_32() and fft_multi_execute_32(). The functions are implemented to fft_multi_hifi3.c and the generic versions are moved to fft_multi_generic.c. in MTL platform the optimization saves 119 MCPS, from 237 MCPS to 118 MCPS. The test was done with script run: scripts/rebuild-testbench.sh -p mtl scripts/sof-testbench-helper.sh -x -m stft_process_1536_240_ \ -p profile-stft_process.txt The above STFT used FFT length of 1536 with hop 240. Signed-off-by: Seppo Ingalsuo <seppo.ingalsuo@linux.intel.com>
608b774 to
2c43e7c
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| /* Twiddle factor tables defined in fft_multi.c via twiddle_3072_32.h */ | ||
| #define FFT_MULTI_TWIDDLE_SIZE 2048 | ||
| extern const int32_t multi_twiddle_real_32[]; | ||
| extern const int32_t multi_twiddle_imag_32[]; | ||
|
|
||
| /* Constants for size 3 DFT */ | ||
| #define DFT3_COEFR -1073741824 /* int32(-0.5 * 2^31) */ | ||
| #define DFT3_COEFI 1859775393 /* int32(sqrt(3) / 2 * 2^31) */ | ||
| #define DFT3_SCALE 715827883 /* int32(1/3*2^31) */ |
There was a problem hiding this comment.
FFT_MULTI_TWIDDLE_SIZE is hard-coded here (2048) even though the authoritative value is already defined in twiddle_3072_32.h. Duplicating this constant in multiple translation units risks the size drifting out of sync with the actual twiddle tables, which would break indexing in the twiddle-multiply loop. Consider introducing a small shared header for the multi-twiddle tables (size macro + extern declarations) and include that instead of repeating the literal size in each implementation.
| /* Twiddle factor tables defined in fft_multi.c via twiddle_3072_32.h */ | |
| #define FFT_MULTI_TWIDDLE_SIZE 2048 | |
| extern const int32_t multi_twiddle_real_32[]; | |
| extern const int32_t multi_twiddle_imag_32[]; | |
| /* Constants for size 3 DFT */ | |
| #define DFT3_COEFR -1073741824 /* int32(-0.5 * 2^31) */ | |
| #define DFT3_COEFI 1859775393 /* int32(sqrt(3) / 2 * 2^31) */ | |
| #define DFT3_SCALE 715827883 /* int32(1/3*2^31) */ | |
| #include <sof/math/twiddle_3072_32.h> | |
| /* Constants for size 3 DFT */ | |
| #define DFT3_COEFR -1073741824 /* int32(-0.5 * 2^31) */ | |
| #define DFT3_COEFI 1859775393 /* int32(sqrt(3) / 2 * 2^31) */ | |
| #define DFT3_SCALE 715827883 /* int32(1/3*2^31) */ | |
| #define DFT3_COEFR -1073741824 /* int32(-0.5 * 2^31) */ | |
| #define DFT3_COEFI 1859775393 /* int32(sqrt(3) / 2 * 2^31) */ | |
| #define DFT3_SCALE 715827883 /* int32(1/3*2^31) */ |
There was a problem hiding this comment.
I have another PR under work to make twiddle factors cold data in #10457. It will address this. But as a larger change I don't want to do it now.
| #include <string.h> | ||
|
|
||
| /* Twiddle factor tables defined in fft_multi.c via twiddle_3072_32.h */ | ||
| #define FFT_MULTI_TWIDDLE_SIZE 2048 |
There was a problem hiding this comment.
FFT_MULTI_TWIDDLE_SIZE is duplicated here as a literal (2048) even though twiddle_3072_32.h defines the table size. Having multiple independent definitions increases the chance of size/table mismatches if the twiddle header is regenerated or changed. Consider moving the size macro + extern declarations into a shared header used by both generic and HiFi3 implementations, so only one place defines the twiddle size contract.
| #define FFT_MULTI_TWIDDLE_SIZE 2048 |
| void fft_multi_execute_32(struct fft_multi_plan *plan, bool ifft) | ||
| { | ||
| struct icomplex32 x[FFT_MULTI_COUNT_MAX]; | ||
| struct icomplex32 y[FFT_MULTI_COUNT_MAX]; |
There was a problem hiding this comment.
The new HiFi3-optimized dft3_32() / fft_multi_execute_32() implementations are compiled only when FFT_HIFI3 is defined, but the cmocka FFT tests run in a generic host build where this code is compiled out. This means the new optimized path likely isn't exercised in CI, increasing regression risk (especially around fixed-point scaling/saturation). Consider adding a CI job that builds/runs these tests for an Xtensa/HiFi3 target (or an equivalent emulator/simulator setup) so both generic and HiFi3 paths are validated against the reference vectors.
| struct icomplex32 y[FFT_MULTI_COUNT_MAX]; |
There was a problem hiding this comment.
We need to get it done with ztest. Cmocka isn't performing xt-clang builds and xt-run runs. It's not an issue of this PR but a general gap.
No description provided.