Skip to content

feat(search): add ngram support for Japanese text analysis in Elastic…#2914

Open
hachimitsu-pichi wants to merge 5 commits intomasterfrom
feat/ngram-elasticsearch
Open

feat(search): add ngram support for Japanese text analysis in Elastic…#2914
hachimitsu-pichi wants to merge 5 commits intomasterfrom
feat/ngram-elasticsearch

Conversation

@hachimitsu-pichi
Copy link
Copy Markdown

close #2826

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds Japanese n-gram analysis support to the Elasticsearch message index to improve search behavior for Japanese text (close #2826).

Changes:

  • Adds a text.ngram multi-field to the ES mapping for message text.
  • Introduces an n-gram tokenizer plus index/search analyzers intended for Japanese partial matching.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread service/search/es.go
Comment on lines +110 to +116
"fields": m{
"ngram": m{
"type": "text",
"search_analyzer": "ja_ngram_search_analyzer",
"analyzer": "ja_ngram_index_analyzer",
},
},
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These mapping/analysis changes are only applied when the index is first created (the code creates the index only when it does not exist). Existing deployments won’t pick up the new text.ngram field/analyzers without deleting/recreating or reindexing into a new index. Consider adding an explicit migration/reindex path (e.g., versioned index name + alias) or documenting the required operational steps.

Copilot uses AI. Check for mistakes.
Comment thread service/search/es.go
Comment on lines +188 to +193
"ja_ngram_index_analyzer": m{
"type": "custom",
"char_filter": []string{
"normalize",
},
"tokenizer": "ja_ngram_tokenizer",
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both n-gram analyzers reference a char_filter named normalize, but there is no analysis.char_filter definition in these settings and normalize is not a standard built-in char filter. This will cause index creation to fail with an unknown char filter error. Either define the intended char filter under settings.index.analysis.char_filter or switch to a valid built-in/plugin-provided char filter name, or remove the char_filter entry if it isn't required.

Copilot uses AI. Check for mistakes.
Comment thread service/search/es.go
Comment on lines +110 to +116
"fields": m{
"ngram": m{
"type": "text",
"search_analyzer": "ja_ngram_search_analyzer",
"analyzer": "ja_ngram_index_analyzer",
},
},
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This adds the text.ngram multi-field, but the search query in this file currently searches only the text field (not text.ngram). As a result, the new n-gram analyzer won’t affect search behavior. Consider updating the query to include text.ngram (optionally with a lower boost than text) so Japanese partial matching actually uses the n-gram index.

Suggested change
"fields": m{
"ngram": m{
"type": "text",
"search_analyzer": "ja_ngram_search_analyzer",
"analyzer": "ja_ngram_index_analyzer",
},
},

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread service/search/es.go
Comment on lines +175 to +179
},
},
"filter": m{
"sudachi_split_filter": m{
"type": "sudachi_split",
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The analysis map is closed before filter and analyzer are defined, making filter/analyzer siblings of analysis under index. Elasticsearch requires char_filter/tokenizer/filter/analyzer to all be nested under index.analysis; with the current nesting the index settings JSON will be rejected.

Copilot uses AI. Check for mistakes.
Comment thread service/search/es.go Outdated
Comment on lines +209 to +213
},
"tokenizer": "ja_ngram_tokenizer",
"filter": []string{
"lowercase",
}},
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ja_ngram_search_analyzer literal ends with }}, on the same line as the filter slice closing brace, which is not gofmt-formatted. Since gofmt is enforced via golangci config (.golangci.yml:23-26), please run gofmt so the closing braces/commas are formatted correctly.

Copilot uses AI. Check for mistakes.
Comment thread service/search/es.go Outdated
Comment on lines +151 to +169
},
"filter": m{
"sudachi_split_filter": m{
"type": "sudachi_split",
"mode": "search",

"ja_ngram_tokenizer": m{
"type": "ngram",
"min_gram": 2,
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ja_ngram_tokenizer is declared as a sibling of analysis.tokenizer (right after the tokenizer map is closed). Elasticsearch expects custom tokenizers to be defined inside the analysis.tokenizer object; otherwise analyzers referencing ja_ngram_tokenizer will fail when creating the index.

Copilot uses AI. Check for mistakes.
@Eraxyso
Copy link
Copy Markdown
Contributor

Eraxyso commented Jan 21, 2026

elastic searchを再インデックスする必要があるらしい

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

形態素解析に加えてn-gramでもアナライズを行う

3 participants