Normalize char before pattern lookup (#4252)

There is an edge-case in FuzzyMatchV1 during backward scan, related to normalization: if string is initially denormalized (e.g. Unicode symbol), backward scan will proceed further to the next char; however, when the score is computed, the string is normalized first, then scanned based on the pattern. This leads to accessing pattern index increment, which itself leads to out-of-bound index access, resulting in a panic. To illustrate the process, here's the sequence of operations when search is perfored: 1. during backward scan by "minim" pattern ``` xxxxx Minímal example ^^^^^^^^^^^^ |||||||||||| miniiiiiiiim <- compute score for this substring ``` 2. during compute score by "minim" pattern ``` Minímal exam minimal exam <- normalize chars before computing the score ^^^^^^ |||||| minim <- at this point the pattern is already fully scanned and index is out-of-the-bound ``` In this commit the char is normalized during backward scan, to detect properly the boundaries for the pattern.
2025-08-17 05:03:52 -07:00 · 2025-02-17 13:50:15 +02:00
parent 1eafc4e5d9
commit 01d9d9c8c8
2 changed files with 12 additions and 0 deletions
--- a/src/algo/algo.go
+++ b/src/algo/algo.go
@@ -767,6 +767,9 @@ func FuzzyMatchV1(caseSensitive bool, normalize bool, forward bool, text *util.C
 					char = unicode.To(unicode.LowerCase, char)
 				}
 			}
+			if normalize {
+				char = normalizeRune(char)
+			}

 			pidx_ := indexAt(pidx, lenPattern, forward)
 			pchar := pattern[pidx_]
--- a/src/algo/algo_test.go
+++ b/src/algo/algo_test.go
@@ -200,3 +200,12 @@ func TestLongString(t *testing.T) {
 	bytes[math.MaxUint16] = 'z'
 	assertMatch(t, FuzzyMatchV2, true, true, string(bytes), "zx", math.MaxUint16, math.MaxUint16+2, scoreMatch*2+bonusConsecutive)
 }
+
+func TestLongStringWithNormalize(t *testing.T) {
+	bytes := make([]byte, 30000)
+	for i := range bytes {
+		bytes[i] = 'x'
+	}
+	unicodeString := string(bytes) + " Minímal example"
+	assertMatch2(t, FuzzyMatchV1, false, true, false, unicodeString, "minim", 30001, 30006, 140)
+}