Skip to content

Math: FFT: Optimizations with Xtensa HiFi code#10637

Open
singalsu wants to merge 2 commits intothesofproject:mainfrom
singalsu:math_fft_xtensa_hifi_optimizations
Open

Math: FFT: Optimizations with Xtensa HiFi code#10637
singalsu wants to merge 2 commits intothesofproject:mainfrom
singalsu:math_fft_xtensa_hifi_optimizations

Conversation

@singalsu
Copy link
Copy Markdown
Collaborator

No description provided.

@singalsu singalsu force-pushed the math_fft_xtensa_hifi_optimizations branch from 6acb7ad to 0fef7d3 Compare March 20, 2026 16:00
@singalsu singalsu marked this pull request as ready for review March 20, 2026 16:29
Copilot AI review requested due to automatic review settings March 20, 2026 16:29
@singalsu singalsu requested a review from kv2019i as a code owner March 20, 2026 16:29
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the multi-radix FFT implementation by moving the generic multi-FFT logic out of fft_multi.c and introducing a HiFi3-optimized implementation using Xtensa intrinsics, plus a small HiFi3 FFT kernel optimization.

Changes:

  • Split dft3_32() and fft_multi_execute_32() out of fft_multi.c into new generic and HiFi3-specific source files.
  • Add a new HiFi3-optimized multi-FFT implementation (fft_multi_hifi3.c) using packed complex arithmetic and fused MAC operations.
  • Optimize the HiFi3 32-bit FFT kernel by special-casing the first stage and skipping the twiddle multiply for j=0.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
test/cmocka/src/math/fft/CMakeLists.txt Updates unit test build inputs to compile the new split multi-FFT sources.
src/math/fft/fft_multi_hifi3.c Adds HiFi3-optimized dft3_32() and fft_multi_execute_32() implementations.
src/math/fft/fft_multi_generic.c Adds the generic dft3_32() and fft_multi_execute_32() implementations previously in fft_multi.c.
src/math/fft/fft_multi.c Removes dft3_32()/fft_multi_execute_32() from this file, leaving plan allocation/free + twiddle table inclusion.
src/math/fft/fft_32_hifi3.c Optimizes HiFi3 FFT stage execution (skip twiddle multiply in first stage and for j=0).
src/math/fft/CMakeLists.txt Adds the new multi-FFT source files to the build when CONFIG_MATH_FFT_MULTI is enabled.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@singalsu singalsu force-pushed the math_fft_xtensa_hifi_optimizations branch from 0fef7d3 to 9e70d76 Compare March 20, 2026 16:51
@singalsu singalsu requested a review from Copilot March 20, 2026 17:11
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

}
/* advance pointers past current group's bottom half */
top_ptr += n;
bot_ptr += n;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to check - is this really correct? top_ptr and bot_ptr are incremented n times inside the loop, and now n is added to them again?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is correct, for the loop of k, the top is incremented by m. In the pointer arithmetic version the loop increments them one by one to n. But since n is m/2, need to increment other n.

@singalsu singalsu force-pushed the math_fft_xtensa_hifi_optimizations branch from 9e70d76 to 21d771f Compare March 24, 2026 15:22
@singalsu singalsu requested review from Copilot and lyakh March 26, 2026 17:48
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +106 to +107
sample1 = tw_r[0];
sample2 = tw_i[0];
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Twiddle loading here assigns int32_t values directly into ae_int32x2 temporaries before packing with AE_SEL32_LH(). In other HiFi3 code paths (e.g. fft_multi_hifi3.c) scalar-to-vector uses AE_MOVDA32() to avoid relying on implicit conversions / lane contents. Please convert tw_r[0]/tw_i[0] using AE_MOVDA32() (or an equivalent intrinsic) before packing, so the real/imag lanes are well-defined and the complex multiply uses the intended coefficients.

Suggested change
sample1 = tw_r[0];
sample2 = tw_i[0];
sample1 = AE_MOVDA32(tw_r[0]);
sample2 = AE_MOVDA32(tw_i[0]);

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the cleaner look of trivial operations like "=", adding those AE_MOVDA32() is same MCPS and probably same code, but true that it's defined in the manual to know what exactly happens without guessing.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using now AE_MOVDA32X2().

Copy link
Copy Markdown
Member

@lgirdwood lgirdwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice MCPS saving, good for me, but pls respond to copilot.

@singalsu singalsu force-pushed the math_fft_xtensa_hifi_optimizations branch from 21d771f to 608b774 Compare March 27, 2026 17:14
This patch optimizes the cycle count of the radix-2 Cooley-Tukey
implementation with three changes:

- Dedicated depth-1 stage: all N/2 butterflies use a real twiddle
  factor W^0 = 1+0j, so the complex multiply is replaced by plain
  add or subtract.

- Skip multiply for j=0 in stages >= 2: The first butterfly in every
  group also uses W^0, saving an additional ~N/2 complex multiplications
  across all remaining stages.

- Pointer arithmetic: replace per-butterfly index arithmetic
  (outx[k+j], outx[k+j+n], twiddle[i*j]) with auto-incrementing
  pointers and strided twiddle access (tw_r += stride), eliminating
  integer multiplies for address computation.

This change saves 11 MCPS (from 74 MCPS to 63 MCPS) in STFT Process
module in MTL platform with 1024/256 size/hop FFT processing. It was
tested with scripts:

scripts/rebuild-testbench.sh -p mtl
scripts/sof-testbench-helper.sh -x -m stft_process_1024_256_ \
  -p profile-stft_process.txt

Signed-off-by: Seppo Ingalsuo <seppo.ingalsuo@linux.intel.com>
This patch adds HiFi3 versions for functions dft3_32() and
fft_multi_execute_32(). The functions are implemented to
fft_multi_hifi3.c and the generic versions are moved to
fft_multi_generic.c.

in MTL platform the optimization saves 119 MCPS, from 237 MCPS
to 118 MCPS. The test was done with script run:

scripts/rebuild-testbench.sh -p mtl
scripts/sof-testbench-helper.sh -x -m stft_process_1536_240_ \
  -p profile-stft_process.txt

The above STFT used FFT length of 1536 with hop 240.

Signed-off-by: Seppo Ingalsuo <seppo.ingalsuo@linux.intel.com>
@singalsu singalsu force-pushed the math_fft_xtensa_hifi_optimizations branch from 608b774 to 2c43e7c Compare March 27, 2026 17:45
@singalsu singalsu requested a review from Copilot March 27, 2026 17:47
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +17 to +25
/* Twiddle factor tables defined in fft_multi.c via twiddle_3072_32.h */
#define FFT_MULTI_TWIDDLE_SIZE 2048
extern const int32_t multi_twiddle_real_32[];
extern const int32_t multi_twiddle_imag_32[];

/* Constants for size 3 DFT */
#define DFT3_COEFR -1073741824 /* int32(-0.5 * 2^31) */
#define DFT3_COEFI 1859775393 /* int32(sqrt(3) / 2 * 2^31) */
#define DFT3_SCALE 715827883 /* int32(1/3*2^31) */
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FFT_MULTI_TWIDDLE_SIZE is hard-coded here (2048) even though the authoritative value is already defined in twiddle_3072_32.h. Duplicating this constant in multiple translation units risks the size drifting out of sync with the actual twiddle tables, which would break indexing in the twiddle-multiply loop. Consider introducing a small shared header for the multi-twiddle tables (size macro + extern declarations) and include that instead of repeating the literal size in each implementation.

Suggested change
/* Twiddle factor tables defined in fft_multi.c via twiddle_3072_32.h */
#define FFT_MULTI_TWIDDLE_SIZE 2048
extern const int32_t multi_twiddle_real_32[];
extern const int32_t multi_twiddle_imag_32[];
/* Constants for size 3 DFT */
#define DFT3_COEFR -1073741824 /* int32(-0.5 * 2^31) */
#define DFT3_COEFI 1859775393 /* int32(sqrt(3) / 2 * 2^31) */
#define DFT3_SCALE 715827883 /* int32(1/3*2^31) */
#include <sof/math/twiddle_3072_32.h>
/* Constants for size 3 DFT */
#define DFT3_COEFR -1073741824 /* int32(-0.5 * 2^31) */
#define DFT3_COEFI 1859775393 /* int32(sqrt(3) / 2 * 2^31) */
#define DFT3_SCALE 715827883 /* int32(1/3*2^31) */
#define DFT3_COEFR -1073741824 /* int32(-0.5 * 2^31) */
#define DFT3_COEFI 1859775393 /* int32(sqrt(3) / 2 * 2^31) */
#define DFT3_SCALE 715827883 /* int32(1/3*2^31) */

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have another PR under work to make twiddle factors cold data in #10457. It will address this. But as a larger change I don't want to do it now.

#include <string.h>

/* Twiddle factor tables defined in fft_multi.c via twiddle_3072_32.h */
#define FFT_MULTI_TWIDDLE_SIZE 2048
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FFT_MULTI_TWIDDLE_SIZE is duplicated here as a literal (2048) even though twiddle_3072_32.h defines the table size. Having multiple independent definitions increases the chance of size/table mismatches if the twiddle header is regenerated or changed. Consider moving the size macro + extern declarations into a shared header used by both generic and HiFi3 implementations, so only one place defines the twiddle size contract.

Suggested change
#define FFT_MULTI_TWIDDLE_SIZE 2048

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

void fft_multi_execute_32(struct fft_multi_plan *plan, bool ifft)
{
struct icomplex32 x[FFT_MULTI_COUNT_MAX];
struct icomplex32 y[FFT_MULTI_COUNT_MAX];
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new HiFi3-optimized dft3_32() / fft_multi_execute_32() implementations are compiled only when FFT_HIFI3 is defined, but the cmocka FFT tests run in a generic host build where this code is compiled out. This means the new optimized path likely isn't exercised in CI, increasing regression risk (especially around fixed-point scaling/saturation). Consider adding a CI job that builds/runs these tests for an Xtensa/HiFi3 target (or an equivalent emulator/simulator setup) so both generic and HiFi3 paths are validated against the reference vectors.

Suggested change
struct icomplex32 y[FFT_MULTI_COUNT_MAX];

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to get it done with ztest. Cmocka isn't performing xt-clang builds and xt-run runs. It's not an issue of this PR but a general gap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants