feat(search): add ngram support for Japanese text analysis in Elastic…#2914
feat(search): add ngram support for Japanese text analysis in Elastic…#2914hachimitsu-pichi wants to merge 5 commits intomasterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds Japanese n-gram analysis support to the Elasticsearch message index to improve search behavior for Japanese text (close #2826).
Changes:
- Adds a
text.ngrammulti-field to the ES mapping for message text. - Introduces an n-gram tokenizer plus index/search analyzers intended for Japanese partial matching.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| "fields": m{ | ||
| "ngram": m{ | ||
| "type": "text", | ||
| "search_analyzer": "ja_ngram_search_analyzer", | ||
| "analyzer": "ja_ngram_index_analyzer", | ||
| }, | ||
| }, |
There was a problem hiding this comment.
These mapping/analysis changes are only applied when the index is first created (the code creates the index only when it does not exist). Existing deployments won’t pick up the new text.ngram field/analyzers without deleting/recreating or reindexing into a new index. Consider adding an explicit migration/reindex path (e.g., versioned index name + alias) or documenting the required operational steps.
| "ja_ngram_index_analyzer": m{ | ||
| "type": "custom", | ||
| "char_filter": []string{ | ||
| "normalize", | ||
| }, | ||
| "tokenizer": "ja_ngram_tokenizer", |
There was a problem hiding this comment.
Both n-gram analyzers reference a char_filter named normalize, but there is no analysis.char_filter definition in these settings and normalize is not a standard built-in char filter. This will cause index creation to fail with an unknown char filter error. Either define the intended char filter under settings.index.analysis.char_filter or switch to a valid built-in/plugin-provided char filter name, or remove the char_filter entry if it isn't required.
| "fields": m{ | ||
| "ngram": m{ | ||
| "type": "text", | ||
| "search_analyzer": "ja_ngram_search_analyzer", | ||
| "analyzer": "ja_ngram_index_analyzer", | ||
| }, | ||
| }, |
There was a problem hiding this comment.
This adds the text.ngram multi-field, but the search query in this file currently searches only the text field (not text.ngram). As a result, the new n-gram analyzer won’t affect search behavior. Consider updating the query to include text.ngram (optionally with a lower boost than text) so Japanese partial matching actually uses the n-gram index.
| "fields": m{ | |
| "ngram": m{ | |
| "type": "text", | |
| "search_analyzer": "ja_ngram_search_analyzer", | |
| "analyzer": "ja_ngram_index_analyzer", | |
| }, | |
| }, |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| }, | ||
| }, | ||
| "filter": m{ | ||
| "sudachi_split_filter": m{ | ||
| "type": "sudachi_split", |
There was a problem hiding this comment.
The analysis map is closed before filter and analyzer are defined, making filter/analyzer siblings of analysis under index. Elasticsearch requires char_filter/tokenizer/filter/analyzer to all be nested under index.analysis; with the current nesting the index settings JSON will be rejected.
| }, | ||
| "tokenizer": "ja_ngram_tokenizer", | ||
| "filter": []string{ | ||
| "lowercase", | ||
| }}, |
There was a problem hiding this comment.
The ja_ngram_search_analyzer literal ends with }}, on the same line as the filter slice closing brace, which is not gofmt-formatted. Since gofmt is enforced via golangci config (.golangci.yml:23-26), please run gofmt so the closing braces/commas are formatted correctly.
| }, | ||
| "filter": m{ | ||
| "sudachi_split_filter": m{ | ||
| "type": "sudachi_split", | ||
| "mode": "search", | ||
|
|
||
| "ja_ngram_tokenizer": m{ | ||
| "type": "ngram", | ||
| "min_gram": 2, |
There was a problem hiding this comment.
ja_ngram_tokenizer is declared as a sibling of analysis.tokenizer (right after the tokenizer map is closed). Elasticsearch expects custom tokenizers to be defined inside the analysis.tokenizer object; otherwise analyzers referencing ja_ngram_tokenizer will fail when creating the index.
|
elastic searchを再インデックスする必要があるらしい |
close #2826