ripgrep

mirror of https://github.com/BurntSushi/ripgrep.git synced 2025-08-17 05:03:50 -07:00

Author	SHA1	Message	Date
Andrew Gallant	9d738ad0c0	regex: fix inner literal extraction that resulted in false negatives In some rare cases, it was possible for ripgrep's inner literal detector to extract a set of literals that could produce a false negative. #2884 gives an example: `(?i:e.x\|ex)`. In this case, the set extracted can be discovered by running `rg '(?i:e.x\|ex) --trace`: Seq[E("EX"), E("Ex"), E("eX"), E("ex")] This extraction leads to building a multi-substring matcher for `EX`, `Ex`, `eX` and `ex`. Searching the haystack `e-x` produces no match, and thus, ripgrep shows no matches. But the regex `(?i:e.x\|ex)` matches `e-x`. The issue at play here was that when two extracted literal sequences were unioned, we were correctly unioning their "prefix" attribute. And this in turn leads to those literal sequences being combined incorrectly via cross product. This case in particular triggers it because two different optimizations combine to produce an incorrect result. Firslty, the regex has a common prefix extracted and is rewritten as `(?i:e(?:.x\|x))`. Secondly, the `x` in the first branch of the alternation has its `prefix` attribute set to `false` (correctly), which means it can't be cross producted with another concatenation. But in this case, it is unioned with the `x` from the second branch, and this results in the union result having `prefix` set to `true`. This in turn pops up and lets it get cross producted with the `e` prefix, producing an incorrect literal sequence. We fix this by changing the implementation of `union` to return `prefix` set to `true` only when both literal sequences being unioned have `prefix` set to `true`. Doing this exposed a second bug that was present, but was purely cosmetic: the extracted literals in this case, after the fix, are `X` and `x`. They were considered "exact" (i.e., lead to a match), but of course they are not. Observing an `X` or an `x` does not mean there is a match. This was fixed by making `choose` always return an inexact literal sequence. This is perhaps too conservative in aggregate in some cases, but always correct. The idea here is that if one is choosing between two concatenations, then it is likely the case that the sequence returned should be considered inexact. The issue is that this can lead to avoiding cross products in some cases that would otherwise be correct. This is bad because it means extracting shorter literals in some cases. (In general, the longer the literal the better.) But we prioritize correctness for now and fix it. You can see a few tests where this shortens some extracted literals. Fixes #2884	2024-09-08 22:00:46 -04:00
tgolang	22b677900f	doc: fix some typos PR #2754	2024-05-13 07:44:51 -04:00
Andrew Gallant	c21302b409	regex: tweak inner literal heuristic Previously, we had logic to skip our own inner literal optimization if the regex itself was already (likely) accelerated. It turns out that the presence of a Unicode word boundary can defeat acceleration to a point. It's likely enough that even if the underlying regex is accelerated, it would be prudent to do our own inner literal optimization if the pattern has a Unicode word boundary. Normally a Unicode word boundary doesn't defeat literal optimizations, since even the slower engines can make use of prefix literal optimizations. But a regex can be accelerated via its own inner or suffix literal optimizations, and those require the use of a DFA (or lazy DFA). Since DFAs crap out on haystacks that contain a non-ASCII Unicode scalar value when the regex contains a Unicode word boundary, it follows that an "accelerated" can still wind up being quite slow. (An "accelerated" regex can also slow down because of restrictions on avoiding quadratic behavior, but I believe this happens less frequently and is not as severe as the slow down as a result of Unicode word boundaries. Namely, avoiding quadratic behavior just means giving up on the inner literal optimization for a single search. In which case, the regex engine can still fall back to a normal forward DFA. That will definitely be slower than an inner literal optimization done by ripgrep, but not quite as dramatic as it would be when DFAs can't be used at all.)	2023-11-20 23:51:53 -05:00
Ludi Rehak	7c83b90f95	doc: fix typo Closes #2153	2023-07-08 18:52:42 -04:00
Andrew Gallant	ca740d9ace	regex: add new inner literal extractor This is mostly a copy of the prefix literal extractor in regex-syntax, but with a tweaked notion of Seq that keeps track of whether it's a prefix of an expression or not. If it isn't, then we can't cross it as a suffix to another Seq. This new extractor should be a lot more robust than the old one. We actually will keep going through the regex to try and find the "best" literals to search for (according to some heuristic).	2023-07-05 14:04:29 -04:00
Andrew Gallant	8ac66a9e04	regex: refactor matcher construction This does a little bit of refactoring so that we can pass both a ConfiguredHIR and a Regex to the inner literal extraction routine. One downside of this approach is that a regex object hangs on to a ConfiguredHIR. But the extra memory usage is probably negligible. A benefit though is that converting the HIR to its concrete syntax is now lazy and only happens when logging is enabled.	2023-07-05 14:04:29 -04:00
Andrew Gallant	e028ea3792	regex: migrate grep-regex to regex-automata We just do a "basic" dumb migration. We don't try to improve anything here.	2023-07-05 14:04:29 -04:00
Andrew Gallant	1035f6b1ff	deps: initial migration steps to regex 1.9 This leaves the grep-regex crate in tatters. Pretty much the entire thing needs to be re-worked. The upshot is that it should result in some big simplifications. I hope. The idea here is to drop down and actually use regex-automata 0.3 instead of the regex crate itself.	2023-07-05 14:04:29 -04:00
Andrew Gallant	e824531e38	edition: manual changes This is mostly just about removing 'extern crate' everywhere and fixing the fallout.	2021-06-01 21:07:37 -04:00
Andrew Gallant	459a9c5637	edition: initial 'cargo fix --edition' run	2021-06-01 21:07:37 -04:00
Martin Michlmayr	1b2c1dc675	doc: fix typos PR #1605	2020-06-04 09:06:09 -04:00
Andrew Gallant	1c4b5adb7b	regex: fix another inner literal bug It looks like `is_simple` wasn't quite correct. I can't wait until this code is rewritten. It is still not quite clearly correct to me. Fixes #1537	2020-04-01 20:37:48 -04:00
Andrew Gallant	0ea65efd6d	regex: special case literal extraction In a prior commit, we fixed a performance problem with the -w flag by doing a little extra work to extract literals. It turns out that using literals in this case when the -w flag is NOT used results in a performance regression. The reasoning is that we end up using a "fast" regex as a prefilter when the regex engine itself uses its own equivalent prefilter, so ripgrep ends up redoing a fair amount of work. Instead, we only do this extra work when we know the -w flag is enabled.	2020-03-22 21:02:51 -04:00
Andrew Gallant	e772a95b58	regex: avoid using literal optimizations when whitespace is detected If a literal is entirely whitespace, then it's quite likely that it is very common. So when that case occurs, just don't do (inner) literal optimizations at all. The regex engine may still make sub-optimal decisions here, but that's a problem for another day. Fixes #1087	2020-03-15 13:19:14 -04:00
Andrew Gallant	9dd4bf8d7f	style: fix rust-analyzer lint warnings	2020-03-15 13:19:14 -04:00
Andrew Gallant	fdd8510fdd	repo: move all source code in crates directory The top-level listing was just getting a bit too long for my taste. So put all of the code in one directory and shrink the large top-level mess to a small top-level mess. NOTE: This commit only contains renames. The subsequent commit will actually make ripgrep build again. We do it this way with the naive hope that this will make it easier for git history to track the renames. Sigh.	2020-02-17 19:24:53 -05:00

16 Commits