Compare commits

...

46 Commits

Author SHA1 Message Date
Andrew Gallant
f4d07b9cbd grep-cli-0.1.8 2023-07-05 17:09:09 -04:00
Andrew Gallant
0b6eccf4d3 ci: try to fix CI 2023-07-05 14:04:29 -04:00
Andrew Gallant
3ac4541e9f regex: remove old inner literal extractor
(It had already been removed from the crate.)
2023-07-05 14:04:29 -04:00
Andrew Gallant
7b72e982f2 deps: update everything 2023-07-05 14:04:29 -04:00
Andrew Gallant
a68db3ac02 deps: drop temporary patch and move to bstr 1.6
Now that regex 1.9 is out, we can depend on it from crates.io.
2023-07-05 14:04:29 -04:00
Andrew Gallant
b12905daca deps: update everything 2023-07-05 14:04:29 -04:00
Andrew Gallant
ca740d9ace regex: add new inner literal extractor
This is mostly a copy of the prefix literal extractor in regex-syntax,
but with a tweaked notion of Seq that keeps track of whether it's a
prefix of an expression or not. If it isn't, then we can't cross it as a
suffix to another Seq.

This new extractor should be a lot more robust than the old one. We
actually will keep going through the regex to try and find the "best"
literals to search for (according to some heuristic).
2023-07-05 14:04:29 -04:00
Andrew Gallant
e80c102dee regex: tweak formatting of regex-automata version spec
This makes it easier to enable the `logging` feature for regex-automata.

I wish I could just enable it unconditionally, but it winds up producing
a lot of output because ripgrep uses regexes for things other than the
primary search (like every glob). Sigh.
2023-07-05 14:04:29 -04:00
Andrew Gallant
8ac66a9e04 regex: refactor matcher construction
This does a little bit of refactoring so that we can pass both a
ConfiguredHIR and a Regex to the inner literal extraction routine.

One downside of this approach is that a regex object hangs on to a
ConfiguredHIR. But the extra memory usage is probably negligible. A
benefit though is that converting the HIR to its concrete syntax is now
lazy and only happens when logging is enabled.
2023-07-05 14:04:29 -04:00
Andrew Gallant
04dde9a4eb regex: tweak DFA settings
This increases the limits a bit for when the regex engine will build and
use a fully compiled DFA. They can faster in some circumstances. For
example, '(?-u)^\w{30,}$' gets a nice speed boost from state
acceleration.

We are also able to remove `regex` proper as a dependency. Wow.
2023-07-05 14:04:29 -04:00
Andrew Gallant
81341702af regex: push more pattern handling to matcher construction
Previously, ripgrep core was responsible for escaping regex patterns and
implementing the --line-regexp flag. This commit moves that
responsibility down into the matchers such that ripgrep just needs to
hand the patterns it gets off to the matcher builder. The builder will
then take care of escaping and all that.

This was done to make pattern construction completely owned by the
matcher builders. With the arrival regex-automata, this means we can
move to the HIR very quickly and then never move back to the concrete
syntax. We can then build our regex directly from the HIR. This overall
can save quite a bit of time, especially when searching for large
dictionaries.

We still aren't quite as fast as GNU grep when searching something on
the scale of /usr/share/dict/words, but we are basically within spitting
distance. Prior to this, we were about an order of magnitude slower.

This architecture in particular lets us write a pretty simple fast path
that avoids AST parsing and HIR translation entirely: the case where one
is just searching for a literal. In that case, we can hand construct the
HIR directly.
2023-07-05 14:04:29 -04:00
Andrew Gallant
d34c5c88a7 globset: fix build error in tests
I guess we haven't been testing with the Serde feature enabled? Weird.
2023-07-05 14:04:29 -04:00
Andrew Gallant
4b8aa91ae5 deps: update to pcre2 0.2.4
0.2.4 updates to PCRE2 10.42 and has a few other nice changes. For
example, when `utf` is enabled, the crate will always set the
PCRE2_MATCH_INVALID_UTF option. That means we no longer need to do
transcoding or UTF-8 validity checks.

Because of this, we actually get to remove one of the two uses of
`unsafe` in ripgrep's `main` program.

(This also updates a couple other dependencies for convenience.)
2023-07-05 14:04:29 -04:00
Andrew Gallant
a775b493fd regex: small cleanups
Just some small polishing. We also get rid of thread_local in favor of
using regex-automata, mostly just in the name of reducing dependencies.
(We should eventually be able to drop thread_local completely.)
2023-07-05 14:04:29 -04:00
Andrew Gallant
a6dbff502f regex: s/locations/captures
Now that we use regex-automata, we no longer use any type with
"locations" in it. Instead, that's mostly legacy from the top-level
regex crate.
2023-07-05 14:04:29 -04:00
Andrew Gallant
51480d57a6 regex: simplify AST analysis a bit
The verbatim literal stuff hasn't been used for a while and I don't
foresee it being used. If it's really needed, it would probably better
to just implement it by looking at the pattern string itself, which
avoids parsing it into an AST altogether.
2023-07-05 14:04:29 -04:00
Andrew Gallant
d9bd261be8 regex: some small cleanup in 'strip.rs'
We also utilize bstr's methods to get rid of some helpers we had written
by hand.
2023-07-05 14:04:29 -04:00
Andrew Gallant
9d62eb997a BREAKING: regex: finally remove CRLF hack
Now that Rust's regex crate finally supports a CRLF mode, we can remove
this giant hack in ripgrep to enable it. (And assuredly did not work in
all cases.)

The way this works in the regex engine is actually subtly different than
what ripgrep previously did. Namely, --crlf would previously treat
either \r\n or \n as a line terminator. But now it treats \r\n, \n and
\r as line terminators. In effect, it is implemented by treating \r and
\n as line terminators, but ^ and $ will never match at a position
between a \r and a \n.

So basically this means that $ will end up matching in more cases than
it might be intended too, but I don't expect this to be a big problem in
practice.

Note that passing --crlf to ripgrep and enabling CRLF mode in the regex
via the `R` inline flag (e.g., `(?R:$)`) are subtly different. The `R`
flag just controls the regex engine, but --crlf instructs all of ripgrep
to use \r\n as a line terminator. There are likely some inconsistencies
or corner cases that are wrong as a result of this cognitive dissonance,
but we choose to leave well enough alone for now.

Fixing this for real will probably require re-thinking how line
terminators are handled in ripgrep. For example, one "problem" with how
they're handled now is that ripgrep will re-insert its own line
terminators when printing output instead of copying the input. This is
maybe not so great and perhaps unexpected. (ripgrep probably can't get
away with not inserting any line terminators. Users probably expect
files that don't end with a line terminator whose last line matches to
have a line terminator inserted.)
2023-07-05 14:04:29 -04:00
Andrew Gallant
e028ea3792 regex: migrate grep-regex to regex-automata
We just do a "basic" dumb migration. We don't try to improve anything
here.
2023-07-05 14:04:29 -04:00
Andrew Gallant
1035f6b1ff deps: initial migration steps to regex 1.9
This leaves the grep-regex crate in tatters. Pretty much the entire
thing needs to be re-worked. The upshot is that it should result in some
big simplifications. I hope.

The idea here is to drop down and actually use regex-automata 0.3
instead of the regex crate itself.
2023-07-05 14:04:29 -04:00
Andrew Gallant
a7f1276021 readme: update Debian instructions
We probably don't need to mention Buster specifically nor Debian
unstable since ripgrep has been in Debian for a while now.

But we can't just get rid of the `deb` file either, because Debian might
package a very old version.

Fixes #2531
2023-06-12 07:50:13 -04:00
Martin Nordholts
4fcb1b2202 cli: replace atty with std::io::IsTerminal
The `atty` crate is unmaintained[1] and `std::io::IsTerminal` was
stabilized in Rust 1.70.

[1]: https://rustsec.org/advisories/RUSTSEC-2021-0145.html

PR #2526
2023-06-05 14:00:46 -04:00
Francois Marier
949092fd22 ignore/types: add 'mdwn' to Markdown
PR #2520
2023-05-26 14:44:41 -04:00
Andrew Gallant
4a7e7094ad deps: update everything else 2023-05-25 13:06:13 -04:00
Andrew Gallant
fc0d9b90a9 deps: bump regex to 1.8.3
This brings in an update from the regex crate that fixes a matching bug
for particular kinds of alternations of literals.

Fixes #2518
2023-05-25 13:06:13 -04:00
Ville Skyttä
335aa4937a ignore/types: add *.pyi for Python
https://peps.python.org/pep-0484/#stub-files

PR #2517
2023-05-23 07:10:02 -04:00
Adam Reichold
803c447845 searcher: re-enable mmap on 32-bit architectures
memmap2 v0.3.0 introduced a regression when trying to map files larger than 4GB
on 32-bit architectures[1] which was subsequently fixed in v0.3.1[2].

This commit bumps locked version of the memmap2 dependency to the current v0.5.0
and reverts fdfc418be5 to re-enable mmap on 32-bit
architectures as a different approach to fixing [3].

This was tested to report matches from the end of a 5GB file using MinGW and Wine.

Ref #1911, PR #2000 

[1] 5e271224c8
[2] 9aa838aed9
[3] https://github.com/BurntSushi/ripgrep/issues/1911
2023-05-19 08:23:53 -04:00
Andrew Gallant
c5415adbe8 deps: update everything
This does unfortunately bring in both regex-syntax 0.6 and 0.7, but
we'll fix that once regex 1.9 is out.
2023-05-16 13:14:23 -04:00
Andrew Gallant
251376597f deps: update minimum version of grep crate
Ref #2516
2023-05-16 13:13:34 -04:00
Andrew Gallant
e593f5b7ee grep-0.2.12 2023-05-16 13:12:45 -04:00
Andrew Gallant
6b19be2477 crates/grep: remove 'deny(missing_docs)'
This crate is only a shim over a bunch of other crates. I'm not sure
that there's anything to add to each of the `pub extern` items. So
instead of just writing fluff, I removed the lint.

Fixes #2516
2023-05-16 13:10:42 -04:00
Ryan Whitehouse
041544853c doc: fix --quiet docs
The wording was previously inverted, which had the opposite
meaning as was intended.

Fixes #1962
2023-03-28 07:22:59 -04:00
Manu
a7ae9e4043 ignore/types: add support for docker-compose files
Default file is docker-compose.yml and the documentation
mentions overrides in the form of docker-compose.*.yml.

PR #2469
2023-03-21 12:56:38 -04:00
Andrew Gallant
595e7845b8 readme: add a link to delta's support for ripgrep
Ref: https://github.com/BurntSushi/ripgrep/issues/86#issuecomment-1469717706
2023-03-15 08:02:04 -04:00
David Ringo
44fb9fce2c ignore/types: add *.sln for msbuild
.sln is the extension for Visual Studio Project Soltion files, one of
the file types accepted as inputs by MSBuild.

PR #2415
2023-02-09 21:20:49 -05:00
Vincent Bockaert
339c46a6ed ignore/types: enhance terraform default filter
The default filter for terraform only checks for *.tf files, but there
are quite few other terraform filetypes.

The explanation for all of them can be found below (including link to
documentation from Hashicorp at time of writing)

- *.tf.json & *.tfvars.json is to capture the files written in
  JSON-based variant of the Terraform language
    - https://developer.hashicorp.com/terraform/language/files
- *.tfvars is used to supply variables
    - https://developer.hashicorp.com/terraform/cloud-docs/workspaces/variables#6-auto-tfvars-variable-files
- .terraform.lock.hcl is used as a Dependency lock file
    - https://developer.hashicorp.com/terraform/language/files/dependency-lock
- terraform.rc & .terraformrc, *.tfrc
    - https://developer.hashicorp.com/terraform/cli/config/config-file

PR #2412
2023-02-09 12:57:01 -05:00
Andrew Gallant
fe97c0a152 ignore-0.4.20 2023-01-15 08:21:02 -05:00
Christian Vallentin
826f3fad5b ignore/api: add Clone and Debug impls for OverrideBuilder
PR #2397
2023-01-15 08:16:27 -05:00
Andrew Gallant
bc55049327 readme: update MSRV in README
... this was apparently long outdated, wow.
2023-01-05 12:09:46 -05:00
Andrew Gallant
d58e9353fc deps: update to grep 0.2.11 2023-01-05 09:13:47 -05:00
Andrew Gallant
ca60fef4db grep-0.2.11 2023-01-05 09:12:49 -05:00
Andrew Gallant
a25307d6c8 deps: update to grep-printer 0.1.7 2023-01-05 09:12:37 -05:00
Andrew Gallant
b80947a8b3 grep-printer-0.1.7 2023-01-05 09:11:16 -05:00
Andrew Gallant
ad793a0d8f deps: update to grep-searcher 0.1.11 2023-01-05 09:07:49 -05:00
Andrew Gallant
120e55e7c7 grep-searcher-0.1.11 2023-01-05 09:07:09 -05:00
Andrew Gallant
3941a7701d deps: update to grep-pcre2 0.1.6 2023-01-05 09:06:52 -05:00
35 changed files with 1867 additions and 1574 deletions

View File

@@ -42,31 +42,31 @@ jobs:
- win-gnu
include:
- build: pinned
os: ubuntu-22.04
rust: 1.65.0
os: ubuntu-latest
rust: 1.70.0
- build: stable
os: ubuntu-22.04
os: ubuntu-latest
rust: stable
- build: beta
os: ubuntu-22.04
os: ubuntu-latest
rust: beta
- build: nightly
os: ubuntu-22.04
os: ubuntu-latest
rust: nightly
- build: nightly-musl
os: ubuntu-22.04
os: ubuntu-latest
rust: nightly
target: x86_64-unknown-linux-musl
- build: nightly-32
os: ubuntu-22.04
os: ubuntu-latest
rust: nightly
target: i686-unknown-linux-gnu
- build: nightly-mips
os: ubuntu-22.04
os: ubuntu-latest
rust: nightly
target: mips64-unknown-linux-gnuabi64
- build: nightly-arm
os: ubuntu-22.04
os: ubuntu-latest
rust: nightly
# For stripping release binaries:
# docker run --rm -v $PWD/target:/target:Z \
@@ -75,7 +75,7 @@ jobs:
# /target/arm-unknown-linux-gnueabihf/debug/rg
target: arm-unknown-linux-gnueabihf
- build: macos
os: macos-12
os: macos-latest
rust: nightly
- build: win-msvc
os: windows-2022
@@ -88,12 +88,12 @@ jobs:
uses: actions/checkout@v3
- name: Install packages (Ubuntu)
if: matrix.os == 'ubuntu-22.04'
if: matrix.os == 'ubuntu-latest'
run: |
ci/ubuntu-install-packages
- name: Install packages (macOS)
if: matrix.os == 'macos-12'
if: matrix.os == 'macos-latest'
run: |
ci/macos-install-packages
@@ -178,7 +178,7 @@ jobs:
rustfmt:
name: rustfmt
runs-on: ubuntu-22.04
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v3
@@ -192,7 +192,7 @@ jobs:
docs:
name: Docs
runs-on: ubuntu-22.04
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v3

187
Cargo.lock generated
View File

@@ -4,24 +4,13 @@ version = 3
[[package]]
name = "aho-corasick"
version = "0.7.20"
version = "1.0.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "cc936419f96fa211c1b9166887b38e5e40b19958e5b895be7c1f93adec7071ac"
checksum = "43f6cb1bf222025340178f382c426f13757b2960e89779dfcb319c32542a5a41"
dependencies = [
"memchr",
]
[[package]]
name = "atty"
version = "0.2.14"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "d9b39be18770d11421cdb1b9947a45dd3f37e93092cbf377614828a319d5fee8"
dependencies = [
"hermit-abi",
"libc",
"winapi",
]
[[package]]
name = "base64"
version = "0.20.0"
@@ -36,12 +25,11 @@ checksum = "bef38d45163c2f1dde094a7dfd33ccf595c92905c8f8f4fdc18d06fb1037718a"
[[package]]
name = "bstr"
version = "1.1.0"
version = "1.6.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "b45ea9b00a7b3f2988e9a65ad3917e62123c38dba709b666506207be96d1790b"
checksum = "6798148dccfbff0fae41c7574d2fa8f1ef3492fba0face179de5d8d447d67b05"
dependencies = [
"memchr",
"once_cell",
"regex-automata",
"serde",
]
@@ -54,9 +42,9 @@ checksum = "2c676a478f63e9fa2dd5368a42f28bba0d6c560b775f38583c8bbaa7fcd67c9c"
[[package]]
name = "cc"
version = "1.0.78"
version = "1.0.79"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "a20104e2335ce8a659d6dd92a51a767a0c062599c73b343fd152cb401e828c3d"
checksum = "50d30906286121d95be3d479533b458f87493b30a4b5f79a607db8f5d11aa91f"
dependencies = [
"jobserver",
]
@@ -81,9 +69,9 @@ dependencies = [
[[package]]
name = "crossbeam-channel"
version = "0.5.6"
version = "0.5.8"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "c2dd04ddaf88237dc3b8d8f9a3c1004b506b54b3313403944054d23c0870c521"
checksum = "a33c2bf77f2df06183c3aa30d1e96c0695a313d4f9c453cc3762a6db39f99200"
dependencies = [
"cfg-if",
"crossbeam-utils",
@@ -91,18 +79,18 @@ dependencies = [
[[package]]
name = "crossbeam-utils"
version = "0.8.14"
version = "0.8.16"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "4fb766fa798726286dbbb842f174001dab8abc7b627a1dd86e0b7222a95d929f"
checksum = "5a22b2d63d4d1dc0b7f1b6b2747dd0088008a9be28b6ddf0b1e7d335e3037294"
dependencies = [
"cfg-if",
]
[[package]]
name = "encoding_rs"
version = "0.8.31"
version = "0.8.32"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "9852635589dc9f9ea1b6fe9f05b50ef208c85c834a562f0c6abb1c475736ec2b"
checksum = "071a31f4ee85403370b58aca746f01041ede6f0da2730960ad001edc2b71b394"
dependencies = [
"cfg-if",
"packed_simd_2",
@@ -123,17 +111,11 @@ version = "1.0.7"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "3f9eec918d3f24069decb9af1554cad7c880e2da24a9afd88aca000531ab82c1"
[[package]]
name = "fs_extra"
version = "1.2.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "2022715d62ab30faffd124d40b76f4134a550a87792276512b18d63272333394"
[[package]]
name = "glob"
version = "0.3.0"
version = "0.3.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "9b919933a397b79c37e33b77bb2aa3dc8eb6e165ad809e58ff75bc7db2e34574"
checksum = "d2fabcfbdc87f4758337ca535fb41a6d701b65693ce38287d856d1674551ec9b"
[[package]]
name = "globset"
@@ -152,7 +134,7 @@ dependencies = [
[[package]]
name = "grep"
version = "0.2.10"
version = "0.2.12"
dependencies = [
"grep-cli",
"grep-matcher",
@@ -166,9 +148,8 @@ dependencies = [
[[package]]
name = "grep-cli"
version = "0.1.7"
version = "0.1.8"
dependencies = [
"atty",
"bstr",
"globset",
"lazy_static",
@@ -192,12 +173,13 @@ name = "grep-pcre2"
version = "0.1.6"
dependencies = [
"grep-matcher",
"log",
"pcre2",
]
[[package]]
name = "grep-printer"
version = "0.1.6"
version = "0.1.7"
dependencies = [
"base64",
"bstr",
@@ -217,14 +199,13 @@ dependencies = [
"bstr",
"grep-matcher",
"log",
"regex",
"regex-automata",
"regex-syntax",
"thread_local",
]
[[package]]
name = "grep-searcher"
version = "0.1.10"
version = "0.1.11"
dependencies = [
"bstr",
"bytecount",
@@ -237,18 +218,9 @@ dependencies = [
"regex",
]
[[package]]
name = "hermit-abi"
version = "0.1.19"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "62b467343b94ba476dcb2500d242dadbb39557df889310ac77c5d99100aaac33"
dependencies = [
"libc",
]
[[package]]
name = "ignore"
version = "0.4.19"
version = "0.4.20"
dependencies = [
"crossbeam-channel",
"globset",
@@ -264,18 +236,17 @@ dependencies = [
[[package]]
name = "itoa"
version = "1.0.5"
version = "1.0.8"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "fad582f4b9e86b6caa621cabeb0963332d92eea04729ab12892c2533951e6440"
checksum = "62b02a5381cc465bd3041d84623d0fa3b66738b52b8e2fc3bab8ad63ab032f4a"
[[package]]
name = "jemalloc-sys"
version = "0.5.2+5.3.0-patched"
version = "0.5.3+5.3.0-patched"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "134163979b6eed9564c98637b710b40979939ba351f59952708234ea11b5f3f8"
checksum = "f9bd5d616ea7ed58b571b2e209a65759664d7fb021a0819d7a790afc67e47ca1"
dependencies = [
"cc",
"fs_extra",
"libc",
]
@@ -291,9 +262,9 @@ dependencies = [
[[package]]
name = "jobserver"
version = "0.1.25"
version = "0.1.26"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "068b1ee6743e4d11fb9c6a1e6064b3693a1b600e7f5f5988047d98b3dc9fb90b"
checksum = "936cfd212a0155903bcbc060e316fb6cc7cbf2e1907329391ebadc1fe0ce77c2"
dependencies = [
"libc",
]
@@ -306,9 +277,9 @@ checksum = "e2abad23fbc42b3700f2f279844dc832adb2b2eb069b2df918f455c4e18cc646"
[[package]]
name = "libc"
version = "0.2.139"
version = "0.2.147"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "201de327520df007757c1f0adce6e827fe8562fbc28bfd9c15571c66ca1f5f79"
checksum = "b4668fb0ea861c1df094127ac5f1da3409a82116a4ba74fca2e58ef927159bb3"
[[package]]
name = "libm"
@@ -318,12 +289,9 @@ checksum = "7fc7aa29613bd6a620df431842069224d8bc9011086b1db4c0e0cd47fa03ec9a"
[[package]]
name = "log"
version = "0.4.17"
version = "0.4.19"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "abb12e687cfb44aa40f41fc3978ef76448f9b6038cad6aef4259d3c095a2382e"
dependencies = [
"cfg-if",
]
checksum = "b06a4cde4c0f271a446782e3eff8de789548ce57dbc8eca9292c27f4a42004b4"
[[package]]
name = "memchr"
@@ -333,18 +301,18 @@ checksum = "2dffe52ecf27772e601905b7522cb4ef790d2cc203488bbd0e2fe85fcb74566d"
[[package]]
name = "memmap2"
version = "0.5.8"
version = "0.5.10"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "4b182332558b18d807c4ce1ca8ca983b34c3ee32765e47b3f0f69b90355cc1dc"
checksum = "83faa42c0a078c393f6b29d5db232d8be22776a891f8f56e5284faee4a20b327"
dependencies = [
"libc",
]
[[package]]
name = "once_cell"
version = "1.17.0"
version = "1.18.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "6f61fba1741ea2b3d6a1e3178721804bb716a68a6aeba1149b5d52e3d464ea66"
checksum = "dd8b5dd2ae5ed71462c540258bedcb51965123ad7e7ccf4b9a8cafaa4a63576d"
[[package]]
name = "packed_simd_2"
@@ -358,9 +326,9 @@ dependencies = [
[[package]]
name = "pcre2"
version = "0.2.3"
version = "0.2.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "85b30f2f69903b439dd9dc9e824119b82a55bf113b29af8d70948a03c1b11ab1"
checksum = "486aca7e74edb8cab09a48d461177f450a5cca3b55e61d139f7552190e2bbcf5"
dependencies = [
"libc",
"log",
@@ -370,9 +338,9 @@ dependencies = [
[[package]]
name = "pcre2-sys"
version = "0.2.5"
version = "0.2.6"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "dec30e5e9ec37eb8fbf1dea5989bc957fd3df56fbee5061aa7b7a99dbb37b722"
checksum = "ae234f441970dbd52d4e29bee70f3b56ca83040081cb2b55b7df772b16e0b06e"
dependencies = [
"cc",
"libc",
@@ -381,50 +349,56 @@ dependencies = [
[[package]]
name = "pkg-config"
version = "0.3.26"
version = "0.3.27"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "6ac9a59f73473f1b8d852421e59e64809f025994837ef743615c6d0c5b305160"
checksum = "26072860ba924cbfa98ea39c8c19b4dd6a4a25423dbdf219c1eca91aa0cf6964"
[[package]]
name = "proc-macro2"
version = "1.0.49"
version = "1.0.63"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "57a8eca9f9c4ffde41714334dee777596264c7825420f521abc92b5b5deb63a5"
checksum = "7b368fba921b0dce7e60f5e04ec15e565b3303972b42bcfde1d0713b881959eb"
dependencies = [
"unicode-ident",
]
[[package]]
name = "quote"
version = "1.0.23"
version = "1.0.29"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "8856d8364d252a14d474036ea1358d63c9e6965c8e5c1885c18f73d70bff9c7b"
checksum = "573015e8ab27661678357f27dc26460738fd2b6c86e46f386fde94cb5d913105"
dependencies = [
"proc-macro2",
]
[[package]]
name = "regex"
version = "1.7.0"
version = "1.9.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "e076559ef8e241f2ae3479e36f97bd5741c0330689e217ad51ce2c76808b868a"
checksum = "89089e897c013b3deb627116ae56a6955a72b8bed395c9526af31c9fe528b484"
dependencies = [
"aho-corasick",
"memchr",
"regex-automata",
"regex-syntax",
]
[[package]]
name = "regex-automata"
version = "0.3.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "fa250384981ea14565685dea16a9ccc4d1c541a13f82b9c168572264d1df8c56"
dependencies = [
"aho-corasick",
"memchr",
"regex-syntax",
]
[[package]]
name = "regex-automata"
version = "0.1.10"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "6c230d73fb8d8c1b9c0b3135c5142a8acee3a0558fb8db5cf1cb65f8d7862132"
[[package]]
name = "regex-syntax"
version = "0.6.28"
version = "0.7.3"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "456c603be3e8d448b072f410900c09faf164fbce2d480456f50eea6e25f9c848"
checksum = "2ab07dc67230e4a4718e70fd5c20055a4334b121f1f9db8fe63ef39ce9b8c846"
[[package]]
name = "ripgrep"
@@ -437,7 +411,6 @@ dependencies = [
"jemallocator",
"lazy_static",
"log",
"regex",
"serde",
"serde_derive",
"serde_json",
@@ -447,9 +420,9 @@ dependencies = [
[[package]]
name = "ryu"
version = "1.0.12"
version = "1.0.14"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "7b4b9743ed687d4b4bcedf9ff5eaa7398495ae14e61cba0a295704edbc7decde"
checksum = "fe232bdf6be8c8de797b22184ee71118d63780ea42ac85b61d1baa6d3b782ae9"
[[package]]
name = "same-file"
@@ -462,18 +435,18 @@ dependencies = [
[[package]]
name = "serde"
version = "1.0.152"
version = "1.0.166"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "bb7d1f0d3021d347a83e556fc4683dea2ea09d87bccdf88ff5c12545d89d5efb"
checksum = "d01b7404f9d441d3ad40e6a636a7782c377d2abdbe4fa2440e2edcc2f4f10db8"
dependencies = [
"serde_derive",
]
[[package]]
name = "serde_derive"
version = "1.0.152"
version = "1.0.166"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "af487d118eecd09402d70a5d72551860e788df87b464af30e5ea6a38c75c541e"
checksum = "5dd83d6dde2b6b2d466e14d9d1acce8816dedee94f735eac6395808b3483c6d6"
dependencies = [
"proc-macro2",
"quote",
@@ -482,9 +455,9 @@ dependencies = [
[[package]]
name = "serde_json"
version = "1.0.91"
version = "1.0.100"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "877c235533714907a8c2464236f5c4b2a17262ef1bd71f38f35ea592c8da6883"
checksum = "0f1e14e89be7aa4c4b78bdbdc9eb5bf8517829a600ae8eaa39a6e1d960b5185c"
dependencies = [
"itoa",
"ryu",
@@ -499,9 +472,9 @@ checksum = "8ea5119cdb4c55b55d432abb513a0429384878c15dde60cc77b1c99de1a95a6a"
[[package]]
name = "syn"
version = "1.0.107"
version = "2.0.23"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "1f4064b5b16e03ae50984a5a8ed5d4f8803e6bc1fd170a3cda91a1be4b18e3f5"
checksum = "59fb7d6d8281a51045d62b8eb3a7d1ce347b76f312af50cd3dc0af39c87c1737"
dependencies = [
"proc-macro2",
"quote",
@@ -510,9 +483,9 @@ dependencies = [
[[package]]
name = "termcolor"
version = "1.1.3"
version = "1.2.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "bab24d30b911b2376f3a13cc2cd443142f0c81dda04c118693e35b3835757755"
checksum = "be55cf8942feac5c765c2c993422806843c9a9a45d4d5c407ad6dd2ea95eb9b6"
dependencies = [
"winapi-util",
]
@@ -528,18 +501,19 @@ dependencies = [
[[package]]
name = "thread_local"
version = "1.1.4"
version = "1.1.7"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "5516c27b78311c50bf42c071425c560ac799b11c30b31f87e3081965fe5e0180"
checksum = "3fdd6f064ccff2d6567adcb3873ca630700f00b5ad3f060c25b5dcfd9a4ce152"
dependencies = [
"cfg-if",
"once_cell",
]
[[package]]
name = "unicode-ident"
version = "1.0.6"
version = "1.0.10"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "84a22b9f218b40614adcb3f4ff08b703773ad44fa9423e4e0d346d5db86e4ebc"
checksum = "22049a19f4a68748a168c0fc439f9516686aa045927ff767eca0a85101fb6e73"
[[package]]
name = "unicode-width"
@@ -549,12 +523,11 @@ checksum = "c0edd1e5b14653f783770bce4a4dabb4a5108a5370a5f5d8cfe8710c361f6c8b"
[[package]]
name = "walkdir"
version = "2.3.2"
version = "2.3.3"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "808cf2735cd4b6866113f648b791c6adc5714537bc222d9347bb203386ffda56"
checksum = "36df944cda56c7d8d8b7496af378e6b16de9284591917d307c9b4d313c44e698"
dependencies = [
"same-file",
"winapi",
"winapi-util",
]

View File

@@ -42,12 +42,11 @@ members = [
]
[dependencies]
bstr = "1.1.0"
grep = { version = "0.2.8", path = "crates/grep" }
bstr = "1.6.0"
grep = { version = "0.2.12", path = "crates/grep" }
ignore = { version = "0.4.19", path = "crates/ignore" }
lazy_static = "1.1.0"
log = "0.4.5"
regex = "1.3.5"
serde_json = "1.0.23"
termcolor = "1.1.0"

View File

@@ -6,6 +6,7 @@ image = "burntsushi/cross:i686-unknown-linux-gnu"
[target.mips64-unknown-linux-gnuabi64]
image = "burntsushi/cross:mips64-unknown-linux-gnuabi64"
build-std = true
[target.arm-unknown-linux-gnueabihf]
image = "burntsushi/cross:arm-unknown-linux-gnueabihf"

View File

@@ -287,8 +287,10 @@ $ curl -LO https://github.com/BurntSushi/ripgrep/releases/download/13.0.0/ripgre
$ sudo dpkg -i ripgrep_13.0.0_amd64.deb
```
If you run Debian Buster (currently Debian stable) or Debian sid, ripgrep is
[officially maintained by Debian](https://tracker.debian.org/pkg/rust-ripgrep).
If you run Debian stable, ripgrep is [officially maintained by
Debian](https://tracker.debian.org/pkg/rust-ripgrep), although its version may
be older than the `deb` package available in the previous step.
```
$ sudo apt-get install ripgrep
```
@@ -343,7 +345,7 @@ $ pkgman install ripgrep_x86
If you're a **Rust programmer**, ripgrep can be installed with `cargo`.
* Note that the minimum supported version of Rust for ripgrep is **1.34.0**,
* Note that the minimum supported version of Rust for ripgrep is **1.70.0**,
although ripgrep may work with older versions.
* Note that the binary may be bigger than expected because it contains debug
symbols. This is intentional. To remove debug symbols and therefore reduce
@@ -430,6 +432,14 @@ $ cargo test --all
from the repository root.
### Related tools
* [delta](https://github.com/dandavison/delta) is a syntax highlighting
pager that supports the `rg --json` output format. So all you need to do to
make it work is `rg --json pattern | delta`. See [delta's manual section on
grep](https://dandavison.github.io/delta/grep.html) for more details.
### Vulnerability reporting
For reporting a security vulnerability, please

View File

@@ -1,6 +1,6 @@
[package]
name = "grep-cli"
version = "0.1.7" #:version
version = "0.1.8" #:version
authors = ["Andrew Gallant <jamslam@gmail.com>"]
description = """
Utilities for search oriented command line applications.
@@ -14,8 +14,7 @@ license = "Unlicense OR MIT"
edition = "2018"
[dependencies]
atty = "0.2.11"
bstr = "1.1.0"
bstr = "1.6.0"
globset = { version = "0.4.10", path = "../globset" }
lazy_static = "1.1.0"
log = "0.4.5"

View File

@@ -165,6 +165,8 @@ mod pattern;
mod process;
mod wtr;
use std::io::IsTerminal;
pub use crate::decompress::{
resolve_binary, DecompressionMatcher, DecompressionMatcherBuilder,
DecompressionReader, DecompressionReaderBuilder,
@@ -215,7 +217,7 @@ pub fn is_readable_stdin() -> bool {
/// Returns true if and only if stdin is believed to be connected to a tty
/// or a console.
pub fn is_tty_stdin() -> bool {
atty::is(atty::Stream::Stdin)
std::io::stdin().is_terminal()
}
/// Returns true if and only if stdout is believed to be connected to a tty
@@ -227,11 +229,11 @@ pub fn is_tty_stdin() -> bool {
/// implementations of `ls` will often show one item per line when stdout is
/// redirected, but will condensed output when printing to a tty.
pub fn is_tty_stdout() -> bool {
atty::is(atty::Stream::Stdout)
std::io::stdout().is_terminal()
}
/// Returns true if and only if stderr is believed to be connected to a tty
/// or a console.
pub fn is_tty_stderr() -> bool {
atty::is(atty::Stream::Stderr)
std::io::stderr().is_terminal()
}

View File

@@ -2583,8 +2583,8 @@ Do not print anything to stdout. If a match is found in a file, then ripgrep
will stop searching. This is useful when ripgrep is used only for its exit
code (which will be an error if no matches are found).
When --files is used, then ripgrep will stop finding files after finding the
first file that matches all ignore rules.
When --files is used, ripgrep will stop finding files after finding the
first file that does not match any ignore rules.
"
);
let arg = RGArg::switch("quiet").short("q").help(SHORT).long_help(LONG);

View File

@@ -31,7 +31,6 @@ use ignore::overrides::{Override, OverrideBuilder};
use ignore::types::{FileTypeDef, Types, TypesBuilder};
use ignore::{Walk, WalkBuilder, WalkParallel};
use log;
use regex;
use termcolor::{BufferWriter, ColorChoice, WriteColor};
use crate::app;
@@ -472,24 +471,6 @@ enum EncodingMode {
Disabled,
}
impl EncodingMode {
/// Checks if an explicit encoding has been set. Returns false for
/// automatic BOM sniffing and no sniffing.
///
/// This is only used to determine whether PCRE2 needs to have its own
/// UTF-8 checking enabled. If we have an explicit encoding set, then
/// we're always guaranteed to get UTF-8, so we can disable PCRE2's check.
/// Otherwise, we have no such guarantee, and must enable PCRE2' UTF-8
/// check.
#[cfg(feature = "pcre2")]
fn has_explicit_encoding(&self) -> bool {
match self {
EncodingMode::Some(_) => true,
_ => false,
}
}
}
impl ArgMatches {
/// Create an ArgMatches from clap's parse result.
fn new(clap_matches: clap::ArgMatches<'static>) -> ArgMatches {
@@ -671,6 +652,8 @@ impl ArgMatches {
.multi_line(true)
.unicode(self.unicode())
.octal(false)
.fixed_strings(self.is_present("fixed-strings"))
.whole_line(self.is_present("line-regexp"))
.word(self.is_present("word-regexp"));
if self.is_present("multiline") {
builder.dot_matches_new_line(self.is_present("multiline-dotall"));
@@ -697,12 +680,7 @@ impl ArgMatches {
if let Some(limit) = self.dfa_size_limit()? {
builder.dfa_size_limit(limit);
}
let res = if self.is_present("fixed-strings") {
builder.build_literals(patterns)
} else {
builder.build(&patterns.join("|"))
};
match res {
match builder.build_many(patterns) {
Ok(m) => Ok(m),
Err(err) => Err(From::from(suggest_multiline(err.to_string()))),
}
@@ -719,6 +697,8 @@ impl ArgMatches {
.case_smart(self.case_smart())
.caseless(self.case_insensitive())
.multi_line(true)
.fixed_strings(self.is_present("fixed-strings"))
.whole_line(self.is_present("line-regexp"))
.word(self.is_present("word-regexp"));
// For whatever reason, the JIT craps out during regex compilation with
// a "no more memory" error on 32 bit systems. So don't use it there.
@@ -732,14 +712,6 @@ impl ArgMatches {
}
if self.unicode() {
builder.utf(true).ucp(true);
if self.encoding()?.has_explicit_encoding() {
// SAFETY: If an encoding was specified, then we're guaranteed
// to get valid UTF-8, so we can disable PCRE2's UTF checking.
// (Feeding invalid UTF-8 to PCRE2 is undefined behavior.)
unsafe {
builder.disable_utf_check();
}
}
}
if self.is_present("multiline") {
builder.dotall(self.is_present("multiline-dotall"));
@@ -747,7 +719,7 @@ impl ArgMatches {
if self.is_present("crlf") {
builder.crlf(true);
}
Ok(builder.build(&patterns.join("|"))?)
Ok(builder.build_many(patterns)?)
}
/// Build a JSON printer that writes results to the given writer.
@@ -1080,7 +1052,6 @@ impl ArgMatches {
}
let label = match self.value_of_lossy("encoding") {
None if self.pcre2_unicode() => "utf-8".to_string(),
None => return Ok(EncodingMode::Auto),
Some(label) => label,
};
@@ -1412,11 +1383,6 @@ impl ArgMatches {
/// Get a sequence of all available patterns from the command line.
/// This includes reading the -e/--regexp and -f/--file flags.
///
/// Note that if -F/--fixed-strings is set, then all patterns will be
/// escaped. If -x/--line-regexp is set, then all patterns are surrounded
/// by `^...$`. Other things, such as --word-regexp, are handled by the
/// regex matcher itself.
///
/// If any pattern is invalid UTF-8, then an error is returned.
fn patterns(&self) -> Result<Vec<String>> {
if self.is_present("files") || self.is_present("type-list") {
@@ -1457,16 +1423,6 @@ impl ArgMatches {
Ok(pats)
}
/// Returns a pattern that is guaranteed to produce an empty regular
/// expression that is valid in any position.
fn pattern_empty(&self) -> String {
// This would normally just be an empty string, which works on its
// own, but if the patterns are joined in a set of alternations, then
// you wind up with `foo|`, which is currently invalid in Rust's regex
// engine.
"(?:z{0})*".to_string()
}
/// Converts an OsStr pattern to a String pattern. The pattern is escaped
/// if -F/--fixed-strings is set.
///
@@ -1485,30 +1441,12 @@ impl ArgMatches {
/// Applies additional processing on the given pattern if necessary
/// (such as escaping meta characters or turning it into a line regex).
fn pattern_from_string(&self, pat: String) -> String {
let pat = self.pattern_line(self.pattern_literal(pat));
if pat.is_empty() {
self.pattern_empty()
} else {
pat
}
}
/// Returns the given pattern as a line pattern if the -x/--line-regexp
/// flag is set. Otherwise, the pattern is returned unchanged.
fn pattern_line(&self, pat: String) -> String {
if self.is_present("line-regexp") {
format!(r"^(?:{})$", pat)
} else {
pat
}
}
/// Returns the given pattern as a literal pattern if the
/// -F/--fixed-strings flag is set. Otherwise, the pattern is returned
/// unchanged.
fn pattern_literal(&self, pat: String) -> String {
if self.is_present("fixed-strings") {
regex::escape(&pat)
// This would normally just be an empty string, which works on its
// own, but if the patterns are joined in a set of alternations,
// then you wind up with `foo|`, which is currently invalid in
// Rust's regex engine.
"(?:)".to_string()
} else {
pat
}
@@ -1641,12 +1579,6 @@ impl ArgMatches {
!(self.is_present("no-unicode") || self.is_present("no-pcre2-unicode"))
}
/// Returns true if and only if PCRE2 is enabled and its Unicode mode is
/// enabled.
fn pcre2_unicode(&self) -> bool {
self.is_present("pcre2") && self.unicode()
}
/// Returns true if and only if file names containing each match should
/// be emitted.
fn with_filename(&self, paths: &[PathBuf]) -> bool {

View File

@@ -20,11 +20,11 @@ name = "globset"
bench = false
[dependencies]
aho-corasick = "0.7.3"
bstr = { version = "1.1.0", default-features = false, features = ["std"] }
aho-corasick = "1.0.2"
bstr = { version = "1.6.0", default-features = false, features = ["std"] }
fnv = "1.0.6"
log = { version = "0.4.5", optional = true }
regex = { version = "1.1.5", default-features = false, features = ["perf", "std"] }
regex = { version = "1.8.3", default-features = false, features = ["perf", "std"] }
serde = { version = "1.0.104", optional = true }
[dev-dependencies]

View File

@@ -498,13 +498,23 @@ impl GlobSetBuilder {
/// Constructing candidates has a very small cost associated with it, so
/// callers may find it beneficial to amortize that cost when matching a single
/// path against multiple globs or sets of globs.
#[derive(Clone, Debug)]
#[derive(Clone)]
pub struct Candidate<'a> {
path: Cow<'a, [u8]>,
basename: Cow<'a, [u8]>,
ext: Cow<'a, [u8]>,
}
impl<'a> std::fmt::Debug for Candidate<'a> {
fn fmt(&self, f: &mut std::fmt::Formatter) -> std::fmt::Result {
f.debug_struct("Candidate")
.field("path", &self.path.as_bstr())
.field("basename", &self.basename.as_bstr())
.field("ext", &self.ext.as_bstr())
.finish()
}
}
impl<'a> Candidate<'a> {
/// Create a new candidate for matching from the given path.
pub fn new<P: AsRef<Path> + ?Sized>(path: &'a P) -> Candidate<'a> {
@@ -818,7 +828,7 @@ impl MultiStrategyBuilder {
fn prefix(self) -> PrefixStrategy {
PrefixStrategy {
matcher: AhoCorasick::new_auto_configured(&self.literals),
matcher: AhoCorasick::new(&self.literals).unwrap(),
map: self.map,
longest: self.longest,
}
@@ -826,7 +836,7 @@ impl MultiStrategyBuilder {
fn suffix(self) -> SuffixStrategy {
SuffixStrategy {
matcher: AhoCorasick::new_auto_configured(&self.literals),
matcher: AhoCorasick::new(&self.literals).unwrap(),
map: self.map,
longest: self.longest,
}

View File

@@ -23,7 +23,7 @@ impl<'de> Deserialize<'de> for Glob {
#[cfg(test)]
mod tests {
use Glob;
use crate::Glob;
#[test]
fn glob_json_works() {

View File

@@ -1,6 +1,6 @@
[package]
name = "grep"
version = "0.2.10" #:version
version = "0.2.12" #:version
authors = ["Andrew Gallant <jamslam@gmail.com>"]
description = """
Fast line oriented regex searching as a library.
@@ -16,10 +16,10 @@ edition = "2018"
[dependencies]
grep-cli = { version = "0.1.7", path = "../cli" }
grep-matcher = { version = "0.1.6", path = "../matcher" }
grep-pcre2 = { version = "0.1.5", path = "../pcre2", optional = true }
grep-printer = { version = "0.1.6", path = "../printer" }
grep-pcre2 = { version = "0.1.6", path = "../pcre2", optional = true }
grep-printer = { version = "0.1.7", path = "../printer" }
grep-regex = { version = "0.1.11", path = "../regex" }
grep-searcher = { version = "0.1.10", path = "../searcher" }
grep-searcher = { version = "0.1.11", path = "../searcher" }
[dev-dependencies]
termcolor = "1.0.4"

View File

@@ -12,8 +12,6 @@ are sparse.
A cookbook and a guide are planned.
*/
#![deny(missing_docs)]
pub extern crate grep_cli as cli;
pub extern crate grep_matcher as matcher;
#[cfg(feature = "pcre2")]

View File

@@ -1,6 +1,6 @@
[package]
name = "ignore"
version = "0.4.19" #:version
version = "0.4.20" #:version
authors = ["Andrew Gallant <jamslam@gmail.com>"]
description = """
A fast library for efficiently matching ignore files such as `.gitignore`
@@ -22,8 +22,8 @@ bench = false
globset = { version = "0.4.10", path = "../globset" }
lazy_static = "1.1"
log = "0.4.5"
memchr = "2.1"
regex = "1.1"
memchr = "2.5"
regex = "1.8.3"
same-file = "1.0.4"
thread_local = "1"
walkdir = "2.2.7"

View File

@@ -60,6 +60,7 @@ pub const DEFAULT_TYPES: &[(&str, &[&str])] = &[
("dhall", &["*.dhall"]),
("diff", &["*.patch", "*.diff"]),
("docker", &["*Dockerfile*"]),
("dockercompose", &["docker-compose.yml", "docker-compose.*.yml"]),
("dts", &["*.dts", "*.dtsi"]),
("dvc", &["Dvcfile", "*.dvc"]),
("ebuild", &["*.ebuild"]),
@@ -149,9 +150,9 @@ pub const DEFAULT_TYPES: &[(&str, &[&str])] = &[
]),
("mako", &["*.mako", "*.mao"]),
("man", &["*.[0-9lnpx]", "*.[0-9][cEFMmpSx]"]),
("markdown", &["*.markdown", "*.md", "*.mdown", "*.mkd", "*.mkdn"]),
("markdown", &["*.markdown", "*.md", "*.mdown", "*.mdwn", "*.mkd", "*.mkdn"]),
("matlab", &["*.m"]),
("md", &["*.markdown", "*.md", "*.mdown", "*.mkd", "*.mkdn"]),
("md", &["*.markdown", "*.md", "*.mdown", "*.mdwn", "*.mkd", "*.mkdn"]),
("meson", &["meson.build", "meson_options.txt"]),
("minified", &["*.min.html", "*.min.css", "*.min.js"]),
("mint", &["*.mint"]),
@@ -160,6 +161,7 @@ pub const DEFAULT_TYPES: &[(&str, &[&str])] = &[
("motoko", &["*.mo"]),
("msbuild", &[
"*.csproj", "*.fsproj", "*.vcxproj", "*.proj", "*.props", "*.targets",
"*.sln",
]),
("nim", &["*.nim", "*.nimf", "*.nimble", "*.nims"]),
("nix", &["*.nix"]),
@@ -184,7 +186,7 @@ pub const DEFAULT_TYPES: &[(&str, &[&str])] = &[
("ps", &["*.cdxml", "*.ps1", "*.ps1xml", "*.psd1", "*.psm1"]),
("puppet", &["*.epp", "*.erb", "*.pp", "*.rb"]),
("purs", &["*.purs"]),
("py", &["*.py"]),
("py", &["*.py", "*.pyi"]),
("qmake", &["*.pro", "*.pri", "*.prf"]),
("qml", &["*.qml"]),
("r", &["*.R", "*.r", "*.Rmd", "*.Rnw"]),
@@ -251,7 +253,11 @@ pub const DEFAULT_TYPES: &[(&str, &[&str])] = &[
("tex", &["*.tex", "*.ltx", "*.cls", "*.sty", "*.bib", "*.dtx", "*.ins"]),
("texinfo", &["*.texi"]),
("textile", &["*.textile"]),
("tf", &["*.tf"]),
("tf", &[
"*.tf", "*.auto.tfvars", "terraform.tfvars", "*.tf.json",
"*.auto.tfvars.json", "terraform.tfvars.json", "*.terraformrc",
"terraform.rc", "*.tfrc", "*.terraform.lock.hcl",
]),
("thrift", &["*.thrift"]),
("toml", &["*.toml", "Cargo.lock"]),
("ts", &["*.ts", "*.tsx", "*.cts", "*.mts"]),

View File

@@ -106,6 +106,7 @@ impl Override {
}
/// Builds a matcher for a set of glob overrides.
#[derive(Clone, Debug)]
pub struct OverrideBuilder {
builder: GitignoreBuilder,
}

View File

@@ -15,4 +15,5 @@ edition = "2018"
[dependencies]
grep-matcher = { version = "0.1.6", path = "../matcher" }
pcre2 = "0.2.3"
log = "0.4.19"
pcre2 = "0.2.4"

View File

@@ -11,6 +11,8 @@ pub struct RegexMatcherBuilder {
builder: RegexBuilder,
case_smart: bool,
word: bool,
fixed_strings: bool,
whole_line: bool,
}
impl RegexMatcherBuilder {
@@ -20,6 +22,8 @@ impl RegexMatcherBuilder {
builder: RegexBuilder::new(),
case_smart: false,
word: false,
fixed_strings: false,
whole_line: false,
}
}
@@ -29,17 +33,40 @@ impl RegexMatcherBuilder {
/// If there was a problem compiling the pattern, then an error is
/// returned.
pub fn build(&self, pattern: &str) -> Result<RegexMatcher, Error> {
self.build_many(&[pattern])
}
/// Compile all of the given patterns into a single regex that matches when
/// at least one of the patterns matches.
///
/// If there was a problem building the regex, then an error is returned.
pub fn build_many<P: AsRef<str>>(
&self,
patterns: &[P],
) -> Result<RegexMatcher, Error> {
let mut builder = self.builder.clone();
if self.case_smart && !has_uppercase_literal(pattern) {
let mut pats = Vec::with_capacity(patterns.len());
for p in patterns.iter() {
pats.push(if self.fixed_strings {
format!("(?:{})", pcre2::escape(p.as_ref()))
} else {
format!("(?:{})", p.as_ref())
});
}
let mut singlepat = pats.join("|");
if self.case_smart && !has_uppercase_literal(&singlepat) {
builder.caseless(true);
}
let res = if self.word {
let pattern = format!(r"(?<!\w)(?:{})(?!\w)", pattern);
builder.build(&pattern)
} else {
builder.build(pattern)
};
res.map_err(Error::regex).map(|regex| {
if self.whole_line {
singlepat = format!(r"(?m:^)(?:{})(?m:$)", singlepat);
} else if self.word {
// We make this option exclusive with whole_line because when
// whole_line is enabled, all matches necessary fall on word
// boundaries. So this extra goop is strictly redundant.
singlepat = format!(r"(?<!\w)(?:{})(?!\w)", singlepat);
}
log::trace!("final regex: {:?}", singlepat);
builder.build(&singlepat).map_err(Error::regex).map(|regex| {
let mut names = HashMap::new();
for (i, name) in regex.capture_names().iter().enumerate() {
if let Some(ref name) = *name {
@@ -144,6 +171,21 @@ impl RegexMatcherBuilder {
self
}
/// Whether the patterns should be treated as literal strings or not. When
/// this is active, all characters, including ones that would normally be
/// special regex meta characters, are matched literally.
pub fn fixed_strings(&mut self, yes: bool) -> &mut RegexMatcherBuilder {
self.fixed_strings = yes;
self
}
/// Whether each pattern should match the entire line or not. This is
/// equivalent to surrounding the pattern with `(?m:^)` and `(?m:$)`.
pub fn whole_line(&mut self, yes: bool) -> &mut RegexMatcherBuilder {
self.whole_line = yes;
self
}
/// Enable Unicode matching mode.
///
/// When enabled, the following patterns become Unicode aware: `\b`, `\B`,
@@ -178,23 +220,22 @@ impl RegexMatcherBuilder {
self
}
/// When UTF matching mode is enabled, this will disable the UTF checking
/// that PCRE2 will normally perform automatically. If UTF matching mode
/// is not enabled, then this has no effect.
/// This is now deprecated and is a no-op.
///
/// UTF checking is enabled by default when UTF matching mode is enabled.
/// If UTF matching mode is enabled and UTF checking is enabled, then PCRE2
/// will return an error if you attempt to search a subject string that is
/// not valid UTF-8.
/// Previously, this option permitted disabling PCRE2's UTF-8 validity
/// check, which could result in undefined behavior if the haystack was
/// not valid UTF-8. But PCRE2 introduced a new option, `PCRE2_MATCH_INVALID_UTF`,
/// in 10.34 which this crate always sets. When this option is enabled,
/// PCRE2 claims to not have undefined behavior when the haystack is
/// invalid UTF-8.
///
/// # Safety
///
/// It is undefined behavior to disable the UTF check in UTF matching mode
/// and search a subject string that is not valid UTF-8. When the UTF check
/// is disabled, callers must guarantee that the subject string is valid
/// UTF-8.
pub unsafe fn disable_utf_check(&mut self) -> &mut RegexMatcherBuilder {
self.builder.disable_utf_check();
/// Therefore, disabling the UTF-8 check is not something that is exposed
/// by this crate.
#[deprecated(
since = "0.2.4",
note = "now a no-op due to new PCRE2 features"
)]
pub fn disable_utf_check(&mut self) -> &mut RegexMatcherBuilder {
self
}

View File

@@ -1,6 +1,6 @@
[package]
name = "grep-printer"
version = "0.1.6" #:version
version = "0.1.7" #:version
authors = ["Andrew Gallant <jamslam@gmail.com>"]
description = """
An implementation of the grep crate's Sink trait that provides standard
@@ -20,9 +20,9 @@ serde1 = ["base64", "serde", "serde_json"]
[dependencies]
base64 = { version = "0.20.0", optional = true }
bstr = "1.1.0"
bstr = "1.6.0"
grep-matcher = { version = "0.1.6", path = "../matcher" }
grep-searcher = { version = "0.1.8", path = "../searcher" }
grep-searcher = { version = "0.1.11", path = "../searcher" }
termcolor = "1.0.4"
serde = { version = "1.0.77", optional = true, features = ["derive"] }
serde_json = { version = "1.0.27", optional = true }

View File

@@ -11,13 +11,12 @@ repository = "https://github.com/BurntSushi/ripgrep/tree/master/crates/regex"
readme = "README.md"
keywords = ["regex", "grep", "search", "pattern", "line"]
license = "Unlicense OR MIT"
edition = "2018"
edition = "2021"
[dependencies]
aho-corasick = "0.7.3"
bstr = "1.1.0"
aho-corasick = "1.0.2"
bstr = "1.6.0"
grep-matcher = { version = "0.1.6", path = "../matcher" }
log = "0.4.5"
regex = "1.1"
regex-syntax = "0.6.5"
thread_local = "1.1.2"
log = "0.4.19"
regex-automata = { version = "0.3.0" }
regex-syntax = "0.7.2"

View File

@@ -1,17 +1,13 @@
use regex_syntax::ast::parse::Parser;
use regex_syntax::ast::{self, Ast};
/// The results of analyzing AST of a regular expression (e.g., for supporting
/// smart case).
#[derive(Clone, Debug)]
pub struct AstAnalysis {
pub(crate) struct AstAnalysis {
/// True if and only if a literal uppercase character occurs in the regex.
any_uppercase: bool,
/// True if and only if the regex contains any literal at all.
any_literal: bool,
/// True if and only if the regex consists entirely of a literal and no
/// other special regex characters.
all_verbatim_literal: bool,
}
impl AstAnalysis {
@@ -19,16 +15,16 @@ impl AstAnalysis {
///
/// If `pattern` is not a valid regular expression, then `None` is
/// returned.
#[allow(dead_code)]
pub fn from_pattern(pattern: &str) -> Option<AstAnalysis> {
Parser::new()
#[cfg(test)]
pub(crate) fn from_pattern(pattern: &str) -> Option<AstAnalysis> {
regex_syntax::ast::parse::Parser::new()
.parse(pattern)
.map(|ast| AstAnalysis::from_ast(&ast))
.ok()
}
/// Perform an AST analysis given the AST.
pub fn from_ast(ast: &Ast) -> AstAnalysis {
pub(crate) fn from_ast(ast: &Ast) -> AstAnalysis {
let mut analysis = AstAnalysis::new();
analysis.from_ast_impl(ast);
analysis
@@ -40,7 +36,7 @@ impl AstAnalysis {
/// For example, a pattern like `\pL` contains no uppercase literals,
/// even though `L` is uppercase and the `\pL` class contains uppercase
/// characters.
pub fn any_uppercase(&self) -> bool {
pub(crate) fn any_uppercase(&self) -> bool {
self.any_uppercase
}
@@ -48,32 +44,13 @@ impl AstAnalysis {
///
/// For example, a pattern like `\pL` reports `false`, but a pattern like
/// `\pLfoo` reports `true`.
pub fn any_literal(&self) -> bool {
pub(crate) fn any_literal(&self) -> bool {
self.any_literal
}
/// Returns true if and only if the entire pattern is a verbatim literal
/// with no special meta characters.
///
/// When this is true, then the pattern satisfies the following law:
/// `escape(pattern) == pattern`. Notable examples where this returns
/// `false` include patterns like `a\u0061` even though `\u0061` is just
/// a literal `a`.
///
/// The purpose of this flag is to determine whether the patterns can be
/// given to non-regex substring search algorithms as-is.
#[allow(dead_code)]
pub fn all_verbatim_literal(&self) -> bool {
self.all_verbatim_literal
}
/// Creates a new `AstAnalysis` value with an initial configuration.
fn new() -> AstAnalysis {
AstAnalysis {
any_uppercase: false,
any_literal: false,
all_verbatim_literal: true,
}
AstAnalysis { any_uppercase: false, any_literal: false }
}
fn from_ast_impl(&mut self, ast: &Ast) {
@@ -86,26 +63,20 @@ impl AstAnalysis {
| Ast::Dot(_)
| Ast::Assertion(_)
| Ast::Class(ast::Class::Unicode(_))
| Ast::Class(ast::Class::Perl(_)) => {
self.all_verbatim_literal = false;
}
| Ast::Class(ast::Class::Perl(_)) => {}
Ast::Literal(ref x) => {
self.from_ast_literal(x);
}
Ast::Class(ast::Class::Bracketed(ref x)) => {
self.all_verbatim_literal = false;
self.from_ast_class_set(&x.kind);
}
Ast::Repetition(ref x) => {
self.all_verbatim_literal = false;
self.from_ast_impl(&x.ast);
}
Ast::Group(ref x) => {
self.all_verbatim_literal = false;
self.from_ast_impl(&x.ast);
}
Ast::Alternation(ref alt) => {
self.all_verbatim_literal = false;
for x in &alt.asts {
self.from_ast_impl(x);
}
@@ -161,9 +132,6 @@ impl AstAnalysis {
}
fn from_ast_literal(&mut self, ast: &ast::Literal) {
if ast.kind != ast::LiteralKind::Verbatim {
self.all_verbatim_literal = false;
}
self.any_literal = true;
self.any_uppercase = self.any_uppercase || ast.c.is_uppercase();
}
@@ -171,7 +139,7 @@ impl AstAnalysis {
/// Returns true if and only if the attributes can never change no matter
/// what other AST it might see.
fn done(&self) -> bool {
self.any_uppercase && self.any_literal && !self.all_verbatim_literal
self.any_uppercase && self.any_literal
}
}
@@ -188,76 +156,61 @@ mod tests {
let x = analysis("");
assert!(!x.any_uppercase);
assert!(!x.any_literal);
assert!(x.all_verbatim_literal);
let x = analysis("foo");
assert!(!x.any_uppercase);
assert!(x.any_literal);
assert!(x.all_verbatim_literal);
let x = analysis("Foo");
assert!(x.any_uppercase);
assert!(x.any_literal);
assert!(x.all_verbatim_literal);
let x = analysis("foO");
assert!(x.any_uppercase);
assert!(x.any_literal);
assert!(x.all_verbatim_literal);
let x = analysis(r"foo\\");
assert!(!x.any_uppercase);
assert!(x.any_literal);
assert!(!x.all_verbatim_literal);
let x = analysis(r"foo\w");
assert!(!x.any_uppercase);
assert!(x.any_literal);
assert!(!x.all_verbatim_literal);
let x = analysis(r"foo\S");
assert!(!x.any_uppercase);
assert!(x.any_literal);
assert!(!x.all_verbatim_literal);
let x = analysis(r"foo\p{Ll}");
assert!(!x.any_uppercase);
assert!(x.any_literal);
assert!(!x.all_verbatim_literal);
let x = analysis(r"foo[a-z]");
assert!(!x.any_uppercase);
assert!(x.any_literal);
assert!(!x.all_verbatim_literal);
let x = analysis(r"foo[A-Z]");
assert!(x.any_uppercase);
assert!(x.any_literal);
assert!(!x.all_verbatim_literal);
let x = analysis(r"foo[\S\t]");
assert!(!x.any_uppercase);
assert!(x.any_literal);
assert!(!x.all_verbatim_literal);
let x = analysis(r"foo\\S");
assert!(x.any_uppercase);
assert!(x.any_literal);
assert!(!x.all_verbatim_literal);
let x = analysis(r"\p{Ll}");
assert!(!x.any_uppercase);
assert!(!x.any_literal);
assert!(!x.all_verbatim_literal);
let x = analysis(r"aBc\w");
assert!(x.any_uppercase);
assert!(x.any_literal);
assert!(!x.all_verbatim_literal);
let x = analysis(r"a\u0061");
assert!(!x.any_uppercase);
assert!(x.any_literal);
assert!(!x.all_verbatim_literal);
}
}

View File

@@ -1,15 +1,16 @@
use grep_matcher::{ByteSet, LineTerminator};
use regex::bytes::{Regex, RegexBuilder};
use regex_syntax::ast::{self, Ast};
use regex_syntax::hir::{self, Hir};
use {
grep_matcher::{ByteSet, LineTerminator},
regex_automata::meta::Regex,
regex_syntax::{
ast,
hir::{self, Hir, HirKind},
},
};
use crate::ast::AstAnalysis;
use crate::crlf::crlfify;
use crate::error::Error;
use crate::literal::LiteralSets;
use crate::multi::alternation_literals;
use crate::non_matching::non_matching_bytes;
use crate::strip::strip_from_match;
use crate::{
ast::AstAnalysis, error::Error, non_matching::non_matching_bytes,
strip::strip_from_match,
};
/// Config represents the configuration of a regex matcher in this crate.
/// The configuration is itself a rough combination of the knobs found in
@@ -21,21 +22,23 @@ use crate::strip::strip_from_match;
/// configuration which generated it, and provides transformation on that HIR
/// such that the configuration is preserved.
#[derive(Clone, Debug)]
pub struct Config {
pub case_insensitive: bool,
pub case_smart: bool,
pub multi_line: bool,
pub dot_matches_new_line: bool,
pub swap_greed: bool,
pub ignore_whitespace: bool,
pub unicode: bool,
pub octal: bool,
pub size_limit: usize,
pub dfa_size_limit: usize,
pub nest_limit: u32,
pub line_terminator: Option<LineTerminator>,
pub crlf: bool,
pub word: bool,
pub(crate) struct Config {
pub(crate) case_insensitive: bool,
pub(crate) case_smart: bool,
pub(crate) multi_line: bool,
pub(crate) dot_matches_new_line: bool,
pub(crate) swap_greed: bool,
pub(crate) ignore_whitespace: bool,
pub(crate) unicode: bool,
pub(crate) octal: bool,
pub(crate) size_limit: usize,
pub(crate) dfa_size_limit: usize,
pub(crate) nest_limit: u32,
pub(crate) line_terminator: Option<LineTerminator>,
pub(crate) crlf: bool,
pub(crate) word: bool,
pub(crate) fixed_strings: bool,
pub(crate) whole_line: bool,
}
impl Default for Config {
@@ -50,47 +53,28 @@ impl Default for Config {
unicode: true,
octal: false,
// These size limits are much bigger than what's in the regex
// crate.
// crate by default.
size_limit: 100 * (1 << 20),
dfa_size_limit: 1000 * (1 << 20),
nest_limit: 250,
line_terminator: None,
crlf: false,
word: false,
fixed_strings: false,
whole_line: false,
}
}
}
impl Config {
/// Parse the given pattern and returned its HIR expression along with
/// the current configuration.
///
/// If there was a problem parsing the given expression then an error
/// is returned.
pub fn hir(&self, pattern: &str) -> Result<ConfiguredHIR, Error> {
let ast = self.ast(pattern)?;
let analysis = self.analysis(&ast)?;
let expr = hir::translate::TranslatorBuilder::new()
.allow_invalid_utf8(true)
.case_insensitive(self.is_case_insensitive(&analysis))
.multi_line(self.multi_line)
.dot_matches_new_line(self.dot_matches_new_line)
.swap_greed(self.swap_greed)
.unicode(self.unicode)
.build()
.translate(pattern, &ast)
.map_err(Error::regex)?;
let expr = match self.line_terminator {
None => expr,
Some(line_term) => strip_from_match(expr, line_term)?,
};
Ok(ConfiguredHIR {
original: pattern.to_string(),
config: self.clone(),
analysis,
// If CRLF mode is enabled, replace `$` with `(?:\r?$)`.
expr: if self.crlf { crlfify(expr) } else { expr },
})
/// Use this configuration to build an HIR from the given patterns. The HIR
/// returned corresponds to a single regex that is an alternation of the
/// patterns given.
pub(crate) fn build_many<P: AsRef<str>>(
&self,
patterns: &[P],
) -> Result<ConfiguredHIR, Error> {
ConfiguredHIR::new(self.clone(), patterns)
}
/// Accounting for the `smart_case` config knob, return true if and only if
@@ -105,35 +89,55 @@ impl Config {
analysis.any_literal() && !analysis.any_uppercase()
}
/// Returns true if and only if this config is simple enough such that
/// if the pattern is a simple alternation of literals, then it can be
/// constructed via a plain Aho-Corasick automaton.
/// Returns whether the given patterns should be treated as "fixed strings"
/// literals. This is different from just querying the `fixed_strings` knob
/// in that if the knob is false, this will still return true in some cases
/// if the patterns are themselves indistinguishable from literals.
///
/// Note that it is OK to return true even when settings like `multi_line`
/// are enabled, since if multi-line can impact the match semantics of a
/// regex, then it is by definition not a simple alternation of literals.
pub fn can_plain_aho_corasick(&self) -> bool {
!self.word && !self.case_insensitive && !self.case_smart
/// The main idea here is that if this returns true, then it is safe
/// to build an `regex_syntax::hir::Hir` value directly from the given
/// patterns as an alternation of `hir::Literal` values.
fn is_fixed_strings<P: AsRef<str>>(&self, patterns: &[P]) -> bool {
// When these are enabled, we really need to parse the patterns and
// let them go through the standard HIR translation process in order
// for case folding transforms to be applied.
if self.case_insensitive || self.case_smart {
return false;
}
/// Perform analysis on the AST of this pattern.
///
/// This returns an error if the given pattern failed to parse.
fn analysis(&self, ast: &Ast) -> Result<AstAnalysis, Error> {
Ok(AstAnalysis::from_ast(ast))
// Even if whole_line or word is enabled, both of those things can
// be implemented by wrapping the Hir generated by an alternation of
// fixed string literals. So for here at least, we don't care about the
// word or whole_line settings.
if self.fixed_strings {
// ... but if any literal contains a line terminator, then we've
// got to bail out because this will ultimately result in an error.
if let Some(lineterm) = self.line_terminator {
for p in patterns.iter() {
if has_line_terminator(lineterm, p.as_ref()) {
return false;
}
/// Parse the given pattern into its abstract syntax.
///
/// This returns an error if the given pattern failed to parse.
fn ast(&self, pattern: &str) -> Result<Ast, Error> {
ast::parse::ParserBuilder::new()
.nest_limit(self.nest_limit)
.octal(self.octal)
.ignore_whitespace(self.ignore_whitespace)
.build()
.parse(pattern)
.map_err(Error::regex)
}
}
return true;
}
// In this case, the only way we can hand construct the Hir is if none
// of the patterns contain meta characters. If they do, then we need to
// send them through the standard parsing/translation process.
for p in patterns.iter() {
let p = p.as_ref();
if p.chars().any(regex_syntax::is_meta_character) {
return false;
}
// Same deal as when fixed_strings is set above. If the pattern has
// a line terminator anywhere, then we need to bail out and let
// an error occur.
if let Some(lineterm) = self.line_terminator {
if has_line_terminator(lineterm, p) {
return false;
}
}
}
true
}
}
@@ -149,170 +153,268 @@ impl Config {
/// size limits set on the configured HIR will be propagated out to any
/// subsequently constructed HIR or regular expression.
#[derive(Clone, Debug)]
pub struct ConfiguredHIR {
original: String,
pub(crate) struct ConfiguredHIR {
config: Config,
analysis: AstAnalysis,
expr: Hir,
hir: Hir,
}
impl ConfiguredHIR {
/// Return the configuration for this HIR expression.
pub fn config(&self) -> &Config {
/// Parse the given patterns into a single HIR expression that represents
/// an alternation of the patterns given.
fn new<P: AsRef<str>>(
config: Config,
patterns: &[P],
) -> Result<ConfiguredHIR, Error> {
let hir = if config.is_fixed_strings(patterns) {
let mut alts = vec![];
for p in patterns.iter() {
alts.push(Hir::literal(p.as_ref().as_bytes()));
}
log::debug!(
"assembling HIR from {} fixed string literals",
alts.len()
);
let hir = Hir::alternation(alts);
hir
} else {
let mut alts = vec![];
for p in patterns.iter() {
alts.push(if config.fixed_strings {
format!("(?:{})", regex_syntax::escape(p.as_ref()))
} else {
format!("(?:{})", p.as_ref())
});
}
let pattern = alts.join("|");
let ast = ast::parse::ParserBuilder::new()
.nest_limit(config.nest_limit)
.octal(config.octal)
.ignore_whitespace(config.ignore_whitespace)
.build()
.parse(&pattern)
.map_err(Error::generic)?;
let analysis = AstAnalysis::from_ast(&ast);
let mut hir = hir::translate::TranslatorBuilder::new()
.utf8(false)
.case_insensitive(config.is_case_insensitive(&analysis))
.multi_line(config.multi_line)
.dot_matches_new_line(config.dot_matches_new_line)
.crlf(config.crlf)
.swap_greed(config.swap_greed)
.unicode(config.unicode)
.build()
.translate(&pattern, &ast)
.map_err(Error::generic)?;
// We don't need to do this for the fixed-strings case above
// because is_fixed_strings will return false if any pattern
// contains a line terminator. Therefore, we don't need to strip
// it.
//
// We go to some pains to avoid doing this in the fixed-strings
// case because this can result in building a new HIR when ripgrep
// is given a huge set of literals to search for. And this can
// actually take a little time. It's not huge, but it's noticeable.
hir = match config.line_terminator {
None => hir,
Some(line_term) => strip_from_match(hir, line_term)?,
};
hir
};
Ok(ConfiguredHIR { config, hir })
}
/// Return a reference to the underlying configuration.
pub(crate) fn config(&self) -> &Config {
&self.config
}
/// Compute the set of non-matching bytes for this HIR expression.
pub fn non_matching_bytes(&self) -> ByteSet {
non_matching_bytes(&self.expr)
/// Return a reference to the underyling HIR.
pub(crate) fn hir(&self) -> &Hir {
&self.hir
}
/// Returns true if and only if this regex needs to have its match offsets
/// tweaked because of CRLF support. Specifically, this occurs when the
/// CRLF hack is enabled and the regex is line anchored at the end. In
/// this case, matches that end with a `\r` have the `\r` stripped.
pub fn needs_crlf_stripped(&self) -> bool {
self.config.crlf && self.expr.is_line_anchored_end()
/// Convert this HIR to a regex that can be used for matching.
pub(crate) fn to_regex(&self) -> Result<Regex, Error> {
let meta = Regex::config()
.utf8_empty(false)
.nfa_size_limit(Some(self.config.size_limit))
// We don't expose a knob for this because the one-pass DFA is
// usually not a perf bottleneck for ripgrep. But we give it some
// extra room than the default.
.onepass_size_limit(Some(10 * (1 << 20)))
// Same deal here. The default limit for full DFAs is VERY small,
// but with ripgrep we can afford to spend a bit more time on
// building them I think.
.dfa_size_limit(Some(1 * (1 << 20)))
.dfa_state_limit(Some(1_000))
.hybrid_cache_capacity(self.config.dfa_size_limit);
Regex::builder()
.configure(meta)
.build_from_hir(&self.hir)
.map_err(Error::regex)
}
/// Compute the set of non-matching bytes for this HIR expression.
pub(crate) fn non_matching_bytes(&self) -> ByteSet {
non_matching_bytes(&self.hir)
}
/// Returns the line terminator configured on this expression.
///
/// When we have beginning/end anchors (NOT line anchors), the fast line
/// searching path isn't quite correct. Or at least, doesn't match the
/// slow path. Namely, the slow path strips line terminators while the
/// fast path does not. Since '$' (when multi-line mode is disabled)
/// doesn't match at line boundaries, the existence of a line terminator
/// might cause it to not match when it otherwise would with the line
/// terminator stripped.
/// searching path isn't quite correct. Or at least, doesn't match the slow
/// path. Namely, the slow path strips line terminators while the fast path
/// does not. Since '$' (when multi-line mode is disabled) doesn't match at
/// line boundaries, the existence of a line terminator might cause it to
/// not match when it otherwise would with the line terminator stripped.
///
/// Since searching with text anchors is exceptionally rare in the
/// context of line oriented searching (multi-line mode is basically
/// always enabled), we just disable this optimization when there are
/// text anchors. We disable it by not returning a line terminator, since
/// Since searching with text anchors is exceptionally rare in the context
/// of line oriented searching (multi-line mode is basically always
/// enabled), we just disable this optimization when there are text
/// anchors. We disable it by not returning a line terminator, since
/// without a line terminator, the fast search path can't be executed.
///
/// Actually, the above is no longer quite correct. Later on, another
/// optimization was added where if the line terminator was in the set of
/// bytes that was guaranteed to never be part of a match, then the higher
/// level search infrastructure assumes that the fast line-by-line search
/// path can still be taken. This optimization applies when multi-line
/// search (not multi-line mode) is enabled. In that case, there is no
/// configured line terminator since the regex is permitted to match a
/// line terminator. But if the regex is guaranteed to never match across
/// multiple lines despite multi-line search being requested, we can still
/// do the faster and more flexible line-by-line search. This is why the
/// non-matching extraction routine removes `\n` when `\A` and `\z` are
/// present even though that's not quite correct...
///
/// See: <https://github.com/BurntSushi/ripgrep/issues/2260>
pub fn line_terminator(&self) -> Option<LineTerminator> {
if self.is_any_anchored() {
pub(crate) fn line_terminator(&self) -> Option<LineTerminator> {
if self.hir.properties().look_set().contains_anchor_haystack() {
None
} else {
self.config.line_terminator
}
}
/// Returns true if and only if the underlying HIR has any text anchors.
fn is_any_anchored(&self) -> bool {
self.expr.is_any_anchored_start() || self.expr.is_any_anchored_end()
}
/// Builds a regular expression from this HIR expression.
pub fn regex(&self) -> Result<Regex, Error> {
self.pattern_to_regex(&self.expr.to_string())
}
/// If this HIR corresponds to an alternation of literals with no
/// capturing groups, then this returns those literals.
pub fn alternation_literals(&self) -> Option<Vec<Vec<u8>>> {
if !self.config.can_plain_aho_corasick() {
return None;
}
alternation_literals(&self.expr)
}
/// Applies the given function to the concrete syntax of this HIR and then
/// generates a new HIR based on the result of the function in a way that
/// preserves the configuration.
/// Turns this configured HIR into one that only matches when both sides of
/// the match correspond to a word boundary.
///
/// For example, this can be used to wrap a user provided regular
/// expression with additional semantics. e.g., See the `WordMatcher`.
pub fn with_pattern<F: FnMut(&str) -> String>(
&self,
mut f: F,
) -> Result<ConfiguredHIR, Error> {
self.pattern_to_hir(&f(&self.expr.to_string()))
/// Note that the HIR returned is like turning `pat` into
/// `(?m:^|\W)(pat)(?m:$|\W)`. That is, the true match is at capture group
/// `1` and not `0`.
pub(crate) fn into_word(self) -> Result<ConfiguredHIR, Error> {
// In theory building the HIR for \W should never fail, but there are
// likely some pathological cases (particularly with respect to certain
// values of limits) where it could in theory fail.
let non_word = {
let mut config = self.config.clone();
config.fixed_strings = false;
ConfiguredHIR::new(config, &[r"\W"])?
};
let line_anchor_start = Hir::look(self.line_anchor_start());
let line_anchor_end = Hir::look(self.line_anchor_end());
let hir = Hir::concat(vec![
Hir::alternation(vec![line_anchor_start, non_word.hir.clone()]),
Hir::capture(hir::Capture {
index: 1,
name: None,
sub: Box::new(renumber_capture_indices(self.hir)?),
}),
Hir::alternation(vec![non_word.hir, line_anchor_end]),
]);
Ok(ConfiguredHIR { config: self.config, hir })
}
/// If the current configuration has a line terminator set and if useful
/// literals could be extracted, then a regular expression matching those
/// literals is returned. If no line terminator is set, then `None` is
/// returned.
///
/// If compiling the resulting regular expression failed, then an error
/// is returned.
///
/// This method only returns something when a line terminator is set
/// because matches from this regex are generally candidates that must be
/// confirmed before reporting a match. When performing a line oriented
/// search, confirmation is easy: just extend the candidate match to its
/// respective line boundaries and then re-search that line for a full
/// match. This only works when the line terminator is set because the line
/// terminator setting guarantees that the regex itself can never match
/// through the line terminator byte.
pub fn fast_line_regex(&self) -> Result<Option<Regex>, Error> {
if self.config.line_terminator.is_none() {
return Ok(None);
/// Turns this configured HIR into an equivalent one, but where it must
/// match at the start and end of a line.
pub(crate) fn into_whole_line(self) -> ConfiguredHIR {
let line_anchor_start = Hir::look(self.line_anchor_start());
let line_anchor_end = Hir::look(self.line_anchor_end());
let hir =
Hir::concat(vec![line_anchor_start, self.hir, line_anchor_end]);
ConfiguredHIR { config: self.config, hir }
}
match LiteralSets::new(&self.expr).one_regex(self.config.word) {
None => Ok(None),
Some(pattern) => self.pattern_to_regex(&pattern).map(Some),
/// Turns this configured HIR into an equivalent one, but where it must
/// match at the start and end of the haystack.
pub(crate) fn into_anchored(self) -> ConfiguredHIR {
let hir = Hir::concat(vec![
Hir::look(hir::Look::Start),
self.hir,
Hir::look(hir::Look::End),
]);
ConfiguredHIR { config: self.config, hir }
}
/// Returns the "start line" anchor for this configuration.
fn line_anchor_start(&self) -> hir::Look {
if self.config.crlf {
hir::Look::StartCRLF
} else {
hir::Look::StartLF
}
}
/// Create a regex from the given pattern using this HIR's configuration.
fn pattern_to_regex(&self, pattern: &str) -> Result<Regex, Error> {
// The settings we explicitly set here are intentionally a subset
// of the settings we have. The key point here is that our HIR
// expression is computed with the settings in mind, such that setting
// them here could actually lead to unintended behavior. For example,
// consider the pattern `(?U)a+`. This will get folded into the HIR
// as a non-greedy repetition operator which will in turn get printed
// to the concrete syntax as `a+?`, which is correct. But if we
// set the `swap_greed` option again, then we'll wind up with `(?U)a+?`
// which is equal to `a+` which is not the same as what we were given.
//
// We also don't need to apply `case_insensitive` since this gets
// folded into the HIR and would just cause us to do redundant work.
//
// Finally, we don't need to set `ignore_whitespace` since the concrete
// syntax emitted by the HIR printer never needs it.
//
// We set the rest of the options. Some of them are important, such as
// the size limit, and some of them are necessary to preserve the
// intention of the original pattern. For example, the Unicode flag
// will impact how the WordMatcher functions, namely, whether its
// word boundaries are Unicode aware or not.
RegexBuilder::new(&pattern)
.nest_limit(self.config.nest_limit)
.octal(self.config.octal)
.multi_line(self.config.multi_line)
.dot_matches_new_line(self.config.dot_matches_new_line)
.unicode(self.config.unicode)
.size_limit(self.config.size_limit)
.dfa_size_limit(self.config.dfa_size_limit)
.build()
.map_err(Error::regex)
/// Returns the "end line" anchor for this configuration.
fn line_anchor_end(&self) -> hir::Look {
if self.config.crlf {
hir::Look::EndCRLF
} else {
hir::Look::EndLF
}
/// Create an HIR expression from the given pattern using this HIR's
/// configuration.
fn pattern_to_hir(&self, pattern: &str) -> Result<ConfiguredHIR, Error> {
// See `pattern_to_regex` comment for explanation of why we only set
// a subset of knobs here. e.g., `swap_greed` is explicitly left out.
let expr = ::regex_syntax::ParserBuilder::new()
.nest_limit(self.config.nest_limit)
.octal(self.config.octal)
.allow_invalid_utf8(true)
.multi_line(self.config.multi_line)
.dot_matches_new_line(self.config.dot_matches_new_line)
.unicode(self.config.unicode)
.build()
.parse(pattern)
.map_err(Error::regex)?;
Ok(ConfiguredHIR {
original: self.original.clone(),
config: self.config.clone(),
analysis: self.analysis.clone(),
expr,
})
}
}
/// This increments the index of every capture group in the given hir by 1. If
/// any increment results in an overflow, then an error is returned.
fn renumber_capture_indices(hir: Hir) -> Result<Hir, Error> {
Ok(match hir.into_kind() {
HirKind::Empty => Hir::empty(),
HirKind::Literal(hir::Literal(lit)) => Hir::literal(lit),
HirKind::Class(cls) => Hir::class(cls),
HirKind::Look(x) => Hir::look(x),
HirKind::Repetition(mut x) => {
x.sub = Box::new(renumber_capture_indices(*x.sub)?);
Hir::repetition(x)
}
HirKind::Capture(mut cap) => {
cap.index = match cap.index.checked_add(1) {
Some(index) => index,
None => {
// This error message kind of sucks, but it's probably
// impossible for it to happen. The only way a capture
// index can overflow addition is if the regex is huge
// (or something else has gone horribly wrong).
let msg = "could not renumber capture index, too big";
return Err(Error::any(msg));
}
};
cap.sub = Box::new(renumber_capture_indices(*cap.sub)?);
Hir::capture(cap)
}
HirKind::Concat(subs) => {
let subs = subs
.into_iter()
.map(|sub| renumber_capture_indices(sub))
.collect::<Result<Vec<Hir>, Error>>()?;
Hir::concat(subs)
}
HirKind::Alternation(subs) => {
let subs = subs
.into_iter()
.map(|sub| renumber_capture_indices(sub))
.collect::<Result<Vec<Hir>, Error>>()?;
Hir::alternation(subs)
}
})
}
/// Returns true if the given literal string contains any byte from the line
/// terminator given.
fn has_line_terminator(lineterm: LineTerminator, literal: &str) -> bool {
if lineterm.is_crlf() {
literal.as_bytes().iter().copied().any(|b| b == b'\r' || b == b'\n')
} else {
literal.as_bytes().iter().copied().any(|b| b == lineterm.as_byte())
}
}

View File

@@ -1,189 +0,0 @@
use std::collections::HashMap;
use grep_matcher::{Match, Matcher, NoError};
use regex::bytes::Regex;
use regex_syntax::hir::{self, Hir, HirKind};
use crate::config::ConfiguredHIR;
use crate::error::Error;
use crate::matcher::RegexCaptures;
/// A matcher for implementing "word match" semantics.
#[derive(Clone, Debug)]
pub struct CRLFMatcher {
/// The regex.
regex: Regex,
/// A map from capture group name to capture group index.
names: HashMap<String, usize>,
}
impl CRLFMatcher {
/// Create a new matcher from the given pattern that strips `\r` from the
/// end of every match.
///
/// This panics if the given expression doesn't need its CRLF stripped.
pub fn new(expr: &ConfiguredHIR) -> Result<CRLFMatcher, Error> {
assert!(expr.needs_crlf_stripped());
let regex = expr.regex()?;
let mut names = HashMap::new();
for (i, optional_name) in regex.capture_names().enumerate() {
if let Some(name) = optional_name {
names.insert(name.to_string(), i.checked_sub(1).unwrap());
}
}
Ok(CRLFMatcher { regex, names })
}
/// Return the underlying regex used by this matcher.
pub fn regex(&self) -> &Regex {
&self.regex
}
}
impl Matcher for CRLFMatcher {
type Captures = RegexCaptures;
type Error = NoError;
fn find_at(
&self,
haystack: &[u8],
at: usize,
) -> Result<Option<Match>, NoError> {
let m = match self.regex.find_at(haystack, at) {
None => return Ok(None),
Some(m) => Match::new(m.start(), m.end()),
};
Ok(Some(adjust_match(haystack, m)))
}
fn new_captures(&self) -> Result<RegexCaptures, NoError> {
Ok(RegexCaptures::new(self.regex.capture_locations()))
}
fn capture_count(&self) -> usize {
self.regex.captures_len().checked_sub(1).unwrap()
}
fn capture_index(&self, name: &str) -> Option<usize> {
self.names.get(name).map(|i| *i)
}
fn captures_at(
&self,
haystack: &[u8],
at: usize,
caps: &mut RegexCaptures,
) -> Result<bool, NoError> {
caps.strip_crlf(false);
let r =
self.regex.captures_read_at(caps.locations_mut(), haystack, at);
if !r.is_some() {
return Ok(false);
}
// If the end of our match includes a `\r`, then strip it from all
// capture groups ending at the same location.
let end = caps.locations().get(0).unwrap().1;
if end > 0 && haystack.get(end - 1) == Some(&b'\r') {
caps.strip_crlf(true);
}
Ok(true)
}
// We specifically do not implement other methods like find_iter or
// captures_iter. Namely, the iter methods are guaranteed to be correct
// by virtue of implementing find_at and captures_at above.
}
/// If the given match ends with a `\r`, then return a new match that ends
/// immediately before the `\r`.
pub fn adjust_match(haystack: &[u8], m: Match) -> Match {
if m.end() > 0 && haystack.get(m.end() - 1) == Some(&b'\r') {
m.with_end(m.end() - 1)
} else {
m
}
}
/// Substitutes all occurrences of multi-line enabled `$` with `(?:\r?$)`.
///
/// This does not preserve the exact semantics of the given expression,
/// however, it does have the useful property that anything that matched the
/// given expression will also match the returned expression. The difference is
/// that the returned expression can match possibly other things as well.
///
/// The principle reason why we do this is because the underlying regex engine
/// doesn't support CRLF aware `$` look-around. It's planned to fix it at that
/// level, but we perform this kludge in the mean time.
///
/// Note that while the match preserving semantics are nice and neat, the
/// match position semantics are quite a bit messier. Namely, `$` only ever
/// matches the position between characters where as `\r??` can match a
/// character and change the offset. This is regretable, but works out pretty
/// nicely in most cases, especially when a match is limited to a single line.
pub fn crlfify(expr: Hir) -> Hir {
match expr.into_kind() {
HirKind::Anchor(hir::Anchor::EndLine) => {
let concat = Hir::concat(vec![
Hir::repetition(hir::Repetition {
kind: hir::RepetitionKind::ZeroOrOne,
greedy: false,
hir: Box::new(Hir::literal(hir::Literal::Unicode('\r'))),
}),
Hir::anchor(hir::Anchor::EndLine),
]);
Hir::group(hir::Group {
kind: hir::GroupKind::NonCapturing,
hir: Box::new(concat),
})
}
HirKind::Empty => Hir::empty(),
HirKind::Literal(x) => Hir::literal(x),
HirKind::Class(x) => Hir::class(x),
HirKind::Anchor(x) => Hir::anchor(x),
HirKind::WordBoundary(x) => Hir::word_boundary(x),
HirKind::Repetition(mut x) => {
x.hir = Box::new(crlfify(*x.hir));
Hir::repetition(x)
}
HirKind::Group(mut x) => {
x.hir = Box::new(crlfify(*x.hir));
Hir::group(x)
}
HirKind::Concat(xs) => {
Hir::concat(xs.into_iter().map(crlfify).collect())
}
HirKind::Alternation(xs) => {
Hir::alternation(xs.into_iter().map(crlfify).collect())
}
}
}
#[cfg(test)]
mod tests {
use super::crlfify;
use regex_syntax::Parser;
fn roundtrip(pattern: &str) -> String {
let expr1 = Parser::new().parse(pattern).unwrap();
let expr2 = crlfify(expr1);
expr2.to_string()
}
#[test]
fn various() {
assert_eq!(roundtrip(r"(?m)$"), "(?:\r??(?m:$))");
assert_eq!(roundtrip(r"(?m)$$"), "(?:\r??(?m:$))(?:\r??(?m:$))");
assert_eq!(
roundtrip(r"(?m)(?:foo$|bar$)"),
"(?:foo(?:\r??(?m:$))|bar(?:\r??(?m:$)))"
);
assert_eq!(roundtrip(r"(?m)$a"), "(?:\r??(?m:$))a");
// Not a multiline `$`, so no crlfifying occurs.
assert_eq!(roundtrip(r"$"), "\\z");
// It's a literal, derp.
assert_eq!(roundtrip(r"\$"), "\\$");
}
}

View File

@@ -1,8 +1,3 @@
use std::error;
use std::fmt;
use crate::util;
/// An error that can occur in this crate.
///
/// Generally, this error corresponds to problems building a regular
@@ -18,10 +13,27 @@ impl Error {
Error { kind }
}
pub(crate) fn regex<E: error::Error>(err: E) -> Error {
pub(crate) fn regex(err: regex_automata::meta::BuildError) -> Error {
if let Some(size_limit) = err.size_limit() {
let kind = ErrorKind::Regex(format!(
"compiled regex exceeds size limit of {size_limit}",
));
Error { kind }
} else if let Some(ref err) = err.syntax_error() {
Error::generic(err)
} else {
Error::generic(err)
}
}
pub(crate) fn generic<E: std::error::Error>(err: E) -> Error {
Error { kind: ErrorKind::Regex(err.to_string()) }
}
pub(crate) fn any<E: ToString>(msg: E) -> Error {
Error { kind: ErrorKind::Regex(msg.to_string()) }
}
/// Return the kind of this error.
pub fn kind(&self) -> &ErrorKind {
&self.kind
@@ -30,6 +42,7 @@ impl Error {
/// The kind of an error that can occur.
#[derive(Clone, Debug)]
#[non_exhaustive]
pub enum ErrorKind {
/// An error that occurred as a result of parsing a regular expression.
/// This can be a syntax error or an error that results from attempting to
@@ -51,38 +64,26 @@ pub enum ErrorKind {
///
/// The invalid byte is included in this error.
InvalidLineTerminator(u8),
/// Hints that destructuring should not be exhaustive.
///
/// This enum may grow additional variants, so this makes sure clients
/// don't count on exhaustive matching. (Otherwise, adding a new variant
/// could break existing code.)
#[doc(hidden)]
__Nonexhaustive,
}
impl error::Error for Error {
fn description(&self) -> &str {
match self.kind {
ErrorKind::Regex(_) => "regex error",
ErrorKind::NotAllowed(_) => "literal not allowed",
ErrorKind::InvalidLineTerminator(_) => "invalid line terminator",
ErrorKind::__Nonexhaustive => unreachable!(),
}
}
}
impl std::error::Error for Error {}
impl std::fmt::Display for Error {
fn fmt(&self, f: &mut std::fmt::Formatter) -> std::fmt::Result {
use bstr::ByteSlice;
impl fmt::Display for Error {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
match self.kind {
ErrorKind::Regex(ref s) => write!(f, "{}", s),
ErrorKind::NotAllowed(ref lit) => {
write!(f, "the literal '{:?}' is not allowed in a regex", lit)
write!(f, "the literal {:?} is not allowed in a regex", lit)
}
ErrorKind::InvalidLineTerminator(byte) => {
let x = util::show_bytes(&[byte]);
write!(f, "line terminators must be ASCII, but '{}' is not", x)
write!(
f,
"line terminators must be ASCII, but {} is not",
[byte].as_bstr()
)
}
ErrorKind::__Nonexhaustive => unreachable!(),
}
}
}

View File

@@ -8,12 +8,9 @@ pub use crate::matcher::{RegexCaptures, RegexMatcher, RegexMatcherBuilder};
mod ast;
mod config;
mod crlf;
mod error;
mod literal;
mod matcher;
mod multi;
mod non_matching;
mod strip;
mod util;
mod word;

File diff suppressed because it is too large Load Diff

View File

@@ -1,15 +1,22 @@
use std::collections::HashMap;
use std::sync::Arc;
use grep_matcher::{
ByteSet, Captures, LineMatchKind, LineTerminator, Match, Matcher, NoError,
use {
grep_matcher::{
ByteSet, Captures, LineMatchKind, LineTerminator, Match, Matcher,
NoError,
},
regex_automata::{
meta::Regex, util::captures::Captures as AutomataCaptures, Input,
PatternID,
},
};
use regex::bytes::{CaptureLocations, Regex};
use crate::config::{Config, ConfiguredHIR};
use crate::crlf::CRLFMatcher;
use crate::error::Error;
use crate::multi::MultiLiteralMatcher;
use crate::word::WordMatcher;
use crate::{
config::{Config, ConfiguredHIR},
error::Error,
literal::InnerLiterals,
word::WordMatcher,
};
/// A builder for constructing a `Matcher` using regular expressions.
///
@@ -43,18 +50,37 @@ impl RegexMatcherBuilder {
/// The syntax supported is documented as part of the regex crate:
/// <https://docs.rs/regex/#syntax>.
pub fn build(&self, pattern: &str) -> Result<RegexMatcher, Error> {
let chir = self.config.hir(pattern)?;
let fast_line_regex = chir.fast_line_regex()?;
let non_matching_bytes = chir.non_matching_bytes();
if let Some(ref re) = fast_line_regex {
log::debug!("extracted fast line regex: {:?}", re);
self.build_many(&[pattern])
}
let matcher = RegexMatcherImpl::new(&chir)?;
log::trace!("final regex: {:?}", matcher.regex());
let mut config = self.config.clone();
// We override the line terminator in case the configured expr doesn't
/// Build a new matcher using the current configuration for the provided
/// patterns. The resulting matcher behaves as if all of the patterns
/// given are joined together into a single alternation. That is, it
/// reports matches where at least one of the given patterns matches.
pub fn build_many<P: AsRef<str>>(
&self,
patterns: &[P],
) -> Result<RegexMatcher, Error> {
let chir = self.config.build_many(patterns)?;
let matcher = RegexMatcherImpl::new(chir)?;
let (chir, re) = (matcher.chir(), matcher.regex());
log::trace!("final regex: {:?}", chir.hir().to_string());
let non_matching_bytes = chir.non_matching_bytes();
// If we can pick out some literals from the regex, then we might be
// able to build a faster regex that quickly identifies candidate
// matching lines. The regex engine will do what it can on its own, but
// we can specifically do a little more when a line terminator is set.
// For example, for a regex like `\w+foo\w+`, we can look for `foo`,
// and when a match is found, look for the line containing `foo` and
// then run the original regex on only that line. (In this case, the
// regex engine is likely to handle this case for us since it's so
// simple, but the idea applies.)
let fast_line_regex = InnerLiterals::new(chir, re).one_regex()?;
// We override the line terminator in case the configured HIR doesn't
// support it.
let mut config = self.config.clone();
config.line_terminator = chir.line_terminator();
Ok(RegexMatcher {
config,
@@ -73,39 +99,7 @@ impl RegexMatcherBuilder {
&self,
literals: &[B],
) -> Result<RegexMatcher, Error> {
let mut has_escape = false;
let mut slices = vec![];
for lit in literals {
slices.push(lit.as_ref());
has_escape = has_escape || lit.as_ref().contains('\\');
}
// Even when we have a fixed set of literals, we might still want to
// use the regex engine. Specifically, if any string has an escape
// in it, then we probably can't feed it to Aho-Corasick without
// removing the escape. Additionally, if there are any particular
// special match semantics we need to honor, that Aho-Corasick isn't
// enough. Finally, the regex engine can do really well with a small
// number of literals (at time of writing, this is changing soon), so
// we use it when there's a small set.
//
// Yes, this is one giant hack. Ideally, this entirely separate literal
// matcher that uses Aho-Corasick would be pushed down into the regex
// engine.
if has_escape
|| !self.config.can_plain_aho_corasick()
|| literals.len() < 40
{
return self.build(&slices.join("|"));
}
let matcher = MultiLiteralMatcher::new(&slices)?;
let imp = RegexMatcherImpl::MultiLiteral(matcher);
Ok(RegexMatcher {
config: self.config.clone(),
matcher: imp,
fast_line_regex: None,
non_matching_bytes: ByteSet::empty(),
})
self.build_many(literals)
}
/// Set the value for the case insensitive (`i`) flag.
@@ -306,20 +300,15 @@ impl RegexMatcherBuilder {
/// 1. It causes the line terminator for the matcher to be `\r\n`. Namely,
/// this prevents the matcher from ever producing a match that contains
/// a `\r` or `\n`.
/// 2. It translates all instances of `$` in the pattern to `(?:\r??$)`.
/// This works around the fact that the regex engine does not support
/// matching CRLF as a line terminator when using `$`.
/// 2. It enables CRLF mode for `^` and `$`. This means that line anchors
/// will treat both `\r` and `\n` as line terminators, but will never
/// match between a `\r` and `\n`.
///
/// In particular, because of (2), the matches produced by the matcher may
/// be slightly different than what one would expect given the pattern.
/// This is the trade off made: in many cases, `$` will "just work" in the
/// presence of `\r\n` line terminators, but matches may require some
/// trimming to faithfully represent the intended match.
///
/// Note that if you do not wish to set the line terminator but would still
/// like `$` to match `\r\n` line terminators, then it is valid to call
/// `crlf(true)` followed by `line_terminator(None)`. Ordering is
/// important, since `crlf` and `line_terminator` override each other.
/// Note that if you do not wish to set the line terminator but would
/// still like `$` to match `\r\n` line terminators, then it is valid to
/// call `crlf(true)` followed by `line_terminator(None)`. Ordering is
/// important, since `crlf` sets the line terminator, but `line_terminator`
/// does not touch the `crlf` setting.
pub fn crlf(&mut self, yes: bool) -> &mut RegexMatcherBuilder {
if yes {
self.config.line_terminator = Some(LineTerminator::crlf());
@@ -345,6 +334,21 @@ impl RegexMatcherBuilder {
self.config.word = yes;
self
}
/// Whether the patterns should be treated as literal strings or not. When
/// this is active, all characters, including ones that would normally be
/// special regex meta characters, are matched literally.
pub fn fixed_strings(&mut self, yes: bool) -> &mut RegexMatcherBuilder {
self.config.fixed_strings = yes;
self
}
/// Whether each pattern should match the entire line or not. This is
/// equivalent to surrounding the pattern with `(?m:^)` and `(?m:$)`.
pub fn whole_line(&mut self, yes: bool) -> &mut RegexMatcherBuilder {
self.config.whole_line = yes;
self
}
}
/// An implementation of the `Matcher` trait using Rust's standard regex
@@ -374,10 +378,10 @@ impl RegexMatcher {
/// Create a new matcher from the given pattern using the default
/// configuration, but matches lines terminated by `\n`.
///
/// This is meant to be a convenience constructor for using a
/// `RegexMatcherBuilder` and setting its
/// [`line_terminator`](struct.RegexMatcherBuilder.html#method.line_terminator)
/// to `\n`. The purpose of using this constructor is to permit special
/// This is meant to be a convenience constructor for
/// using a `RegexMatcherBuilder` and setting its
/// [`line_terminator`](RegexMatcherBuilder::method.line_terminator) to
/// `\n`. The purpose of using this constructor is to permit special
/// optimizations that help speed up line oriented search. These types of
/// optimizations are only appropriate when matches span no more than one
/// line. For this reason, this constructor will return an error if the
@@ -393,13 +397,6 @@ impl RegexMatcher {
enum RegexMatcherImpl {
/// The standard matcher used for all regular expressions.
Standard(StandardMatcher),
/// A matcher for an alternation of plain literals.
MultiLiteral(MultiLiteralMatcher),
/// A matcher that strips `\r` from the end of matches.
///
/// This is only used when the CRLF hack is enabled and the regex is line
/// anchored at the end.
CRLF(CRLFMatcher),
/// A matcher that only matches at word boundaries. This transforms the
/// regex to `(^|\W)(...)($|\W)` instead of the more intuitive `\b(...)\b`.
/// Because of this, the WordMatcher provides its own implementation of
@@ -411,29 +408,33 @@ enum RegexMatcherImpl {
impl RegexMatcherImpl {
/// Based on the configuration, create a new implementation of the
/// `Matcher` trait.
fn new(expr: &ConfiguredHIR) -> Result<RegexMatcherImpl, Error> {
if expr.config().word {
Ok(RegexMatcherImpl::Word(WordMatcher::new(expr)?))
} else if expr.needs_crlf_stripped() {
Ok(RegexMatcherImpl::CRLF(CRLFMatcher::new(expr)?))
fn new(mut chir: ConfiguredHIR) -> Result<RegexMatcherImpl, Error> {
// When whole_line is set, we don't use a word matcher even if word
// matching was requested. Why? Because `(?m:^)(pat)(?m:$)` implies
// word matching.
Ok(if chir.config().word && !chir.config().whole_line {
RegexMatcherImpl::Word(WordMatcher::new(chir)?)
} else {
if let Some(lits) = expr.alternation_literals() {
if lits.len() >= 40 {
let matcher = MultiLiteralMatcher::new(&lits)?;
return Ok(RegexMatcherImpl::MultiLiteral(matcher));
}
}
Ok(RegexMatcherImpl::Standard(StandardMatcher::new(expr)?))
if chir.config().whole_line {
chir = chir.into_whole_line();
}
RegexMatcherImpl::Standard(StandardMatcher::new(chir)?)
})
}
/// Return the underlying regex object used.
fn regex(&self) -> String {
fn regex(&self) -> &Regex {
match *self {
RegexMatcherImpl::Word(ref x) => x.regex().to_string(),
RegexMatcherImpl::CRLF(ref x) => x.regex().to_string(),
RegexMatcherImpl::MultiLiteral(_) => "<N/A>".to_string(),
RegexMatcherImpl::Standard(ref x) => x.regex.to_string(),
RegexMatcherImpl::Word(ref x) => x.regex(),
RegexMatcherImpl::Standard(ref x) => &x.regex,
}
}
/// Return the underlying HIR of the regex used for searching.
fn chir(&self) -> &ConfiguredHIR {
match *self {
RegexMatcherImpl::Word(ref x) => x.chir(),
RegexMatcherImpl::Standard(ref x) => &x.chir,
}
}
}
@@ -453,8 +454,6 @@ impl Matcher for RegexMatcher {
use self::RegexMatcherImpl::*;
match self.matcher {
Standard(ref m) => m.find_at(haystack, at),
MultiLiteral(ref m) => m.find_at(haystack, at),
CRLF(ref m) => m.find_at(haystack, at),
Word(ref m) => m.find_at(haystack, at),
}
}
@@ -463,8 +462,6 @@ impl Matcher for RegexMatcher {
use self::RegexMatcherImpl::*;
match self.matcher {
Standard(ref m) => m.new_captures(),
MultiLiteral(ref m) => m.new_captures(),
CRLF(ref m) => m.new_captures(),
Word(ref m) => m.new_captures(),
}
}
@@ -473,8 +470,6 @@ impl Matcher for RegexMatcher {
use self::RegexMatcherImpl::*;
match self.matcher {
Standard(ref m) => m.capture_count(),
MultiLiteral(ref m) => m.capture_count(),
CRLF(ref m) => m.capture_count(),
Word(ref m) => m.capture_count(),
}
}
@@ -483,8 +478,6 @@ impl Matcher for RegexMatcher {
use self::RegexMatcherImpl::*;
match self.matcher {
Standard(ref m) => m.capture_index(name),
MultiLiteral(ref m) => m.capture_index(name),
CRLF(ref m) => m.capture_index(name),
Word(ref m) => m.capture_index(name),
}
}
@@ -493,8 +486,6 @@ impl Matcher for RegexMatcher {
use self::RegexMatcherImpl::*;
match self.matcher {
Standard(ref m) => m.find(haystack),
MultiLiteral(ref m) => m.find(haystack),
CRLF(ref m) => m.find(haystack),
Word(ref m) => m.find(haystack),
}
}
@@ -506,8 +497,6 @@ impl Matcher for RegexMatcher {
use self::RegexMatcherImpl::*;
match self.matcher {
Standard(ref m) => m.find_iter(haystack, matched),
MultiLiteral(ref m) => m.find_iter(haystack, matched),
CRLF(ref m) => m.find_iter(haystack, matched),
Word(ref m) => m.find_iter(haystack, matched),
}
}
@@ -523,8 +512,6 @@ impl Matcher for RegexMatcher {
use self::RegexMatcherImpl::*;
match self.matcher {
Standard(ref m) => m.try_find_iter(haystack, matched),
MultiLiteral(ref m) => m.try_find_iter(haystack, matched),
CRLF(ref m) => m.try_find_iter(haystack, matched),
Word(ref m) => m.try_find_iter(haystack, matched),
}
}
@@ -537,8 +524,6 @@ impl Matcher for RegexMatcher {
use self::RegexMatcherImpl::*;
match self.matcher {
Standard(ref m) => m.captures(haystack, caps),
MultiLiteral(ref m) => m.captures(haystack, caps),
CRLF(ref m) => m.captures(haystack, caps),
Word(ref m) => m.captures(haystack, caps),
}
}
@@ -555,8 +540,6 @@ impl Matcher for RegexMatcher {
use self::RegexMatcherImpl::*;
match self.matcher {
Standard(ref m) => m.captures_iter(haystack, caps, matched),
MultiLiteral(ref m) => m.captures_iter(haystack, caps, matched),
CRLF(ref m) => m.captures_iter(haystack, caps, matched),
Word(ref m) => m.captures_iter(haystack, caps, matched),
}
}
@@ -573,10 +556,6 @@ impl Matcher for RegexMatcher {
use self::RegexMatcherImpl::*;
match self.matcher {
Standard(ref m) => m.try_captures_iter(haystack, caps, matched),
MultiLiteral(ref m) => {
m.try_captures_iter(haystack, caps, matched)
}
CRLF(ref m) => m.try_captures_iter(haystack, caps, matched),
Word(ref m) => m.try_captures_iter(haystack, caps, matched),
}
}
@@ -590,8 +569,6 @@ impl Matcher for RegexMatcher {
use self::RegexMatcherImpl::*;
match self.matcher {
Standard(ref m) => m.captures_at(haystack, at, caps),
MultiLiteral(ref m) => m.captures_at(haystack, at, caps),
CRLF(ref m) => m.captures_at(haystack, at, caps),
Word(ref m) => m.captures_at(haystack, at, caps),
}
}
@@ -608,8 +585,6 @@ impl Matcher for RegexMatcher {
use self::RegexMatcherImpl::*;
match self.matcher {
Standard(ref m) => m.replace(haystack, dst, append),
MultiLiteral(ref m) => m.replace(haystack, dst, append),
CRLF(ref m) => m.replace(haystack, dst, append),
Word(ref m) => m.replace(haystack, dst, append),
}
}
@@ -629,12 +604,6 @@ impl Matcher for RegexMatcher {
Standard(ref m) => {
m.replace_with_captures(haystack, caps, dst, append)
}
MultiLiteral(ref m) => {
m.replace_with_captures(haystack, caps, dst, append)
}
CRLF(ref m) => {
m.replace_with_captures(haystack, caps, dst, append)
}
Word(ref m) => {
m.replace_with_captures(haystack, caps, dst, append)
}
@@ -645,8 +614,6 @@ impl Matcher for RegexMatcher {
use self::RegexMatcherImpl::*;
match self.matcher {
Standard(ref m) => m.is_match(haystack),
MultiLiteral(ref m) => m.is_match(haystack),
CRLF(ref m) => m.is_match(haystack),
Word(ref m) => m.is_match(haystack),
}
}
@@ -659,8 +626,6 @@ impl Matcher for RegexMatcher {
use self::RegexMatcherImpl::*;
match self.matcher {
Standard(ref m) => m.is_match_at(haystack, at),
MultiLiteral(ref m) => m.is_match_at(haystack, at),
CRLF(ref m) => m.is_match_at(haystack, at),
Word(ref m) => m.is_match_at(haystack, at),
}
}
@@ -672,8 +637,6 @@ impl Matcher for RegexMatcher {
use self::RegexMatcherImpl::*;
match self.matcher {
Standard(ref m) => m.shortest_match(haystack),
MultiLiteral(ref m) => m.shortest_match(haystack),
CRLF(ref m) => m.shortest_match(haystack),
Word(ref m) => m.shortest_match(haystack),
}
}
@@ -686,8 +649,6 @@ impl Matcher for RegexMatcher {
use self::RegexMatcherImpl::*;
match self.matcher {
Standard(ref m) => m.shortest_match_at(haystack, at),
MultiLiteral(ref m) => m.shortest_match_at(haystack, at),
CRLF(ref m) => m.shortest_match_at(haystack, at),
Word(ref m) => m.shortest_match_at(haystack, at),
}
}
@@ -706,7 +667,10 @@ impl Matcher for RegexMatcher {
) -> Result<Option<LineMatchKind>, NoError> {
Ok(match self.fast_line_regex {
Some(ref regex) => {
regex.shortest_match(haystack).map(LineMatchKind::Candidate)
let input = Input::new(haystack);
regex
.search_half(&input)
.map(|hm| LineMatchKind::Candidate(hm.offset()))
}
None => {
self.shortest_match(haystack)?.map(LineMatchKind::Confirmed)
@@ -721,20 +685,19 @@ struct StandardMatcher {
/// The regular expression compiled from the pattern provided by the
/// caller.
regex: Regex,
/// A map from capture group name to its corresponding index.
names: HashMap<String, usize>,
/// The HIR that produced this regex.
///
/// We put this in an `Arc` because by the time it gets here, it won't
/// change. And because cloning and dropping an `Hir` is somewhat expensive
/// due to its deep recursive representation.
chir: Arc<ConfiguredHIR>,
}
impl StandardMatcher {
fn new(expr: &ConfiguredHIR) -> Result<StandardMatcher, Error> {
let regex = expr.regex()?;
let mut names = HashMap::new();
for (i, optional_name) in regex.capture_names().enumerate() {
if let Some(name) = optional_name {
names.insert(name.to_string(), i);
}
}
Ok(StandardMatcher { regex, names })
fn new(chir: ConfiguredHIR) -> Result<StandardMatcher, Error> {
let chir = Arc::new(chir);
let regex = chir.to_regex()?;
Ok(StandardMatcher { regex, chir })
}
}
@@ -747,14 +710,12 @@ impl Matcher for StandardMatcher {
haystack: &[u8],
at: usize,
) -> Result<Option<Match>, NoError> {
Ok(self
.regex
.find_at(haystack, at)
.map(|m| Match::new(m.start(), m.end())))
let input = Input::new(haystack).span(at..haystack.len());
Ok(self.regex.find(input).map(|m| Match::new(m.start(), m.end())))
}
fn new_captures(&self) -> Result<RegexCaptures, NoError> {
Ok(RegexCaptures::new(self.regex.capture_locations()))
Ok(RegexCaptures::new(self.regex.create_captures()))
}
fn capture_count(&self) -> usize {
@@ -762,7 +723,7 @@ impl Matcher for StandardMatcher {
}
fn capture_index(&self, name: &str) -> Option<usize> {
self.names.get(name).map(|i| *i)
self.regex.group_info().to_index(PatternID::ZERO, name)
}
fn try_find_iter<F, E>(
@@ -789,10 +750,10 @@ impl Matcher for StandardMatcher {
at: usize,
caps: &mut RegexCaptures,
) -> Result<bool, NoError> {
Ok(self
.regex
.captures_read_at(&mut caps.locations_mut(), haystack, at)
.is_some())
let input = Input::new(haystack).span(at..haystack.len());
let caps = caps.captures_mut();
self.regex.search_captures(&input, caps);
Ok(caps.is_match())
}
fn shortest_match_at(
@@ -800,7 +761,8 @@ impl Matcher for StandardMatcher {
haystack: &[u8],
at: usize,
) -> Result<Option<usize>, NoError> {
Ok(self.regex.shortest_match_at(haystack, at))
let input = Input::new(haystack).span(at..haystack.len());
Ok(self.regex.search_half(&input).map(|hm| hm.offset()))
}
}
@@ -819,17 +781,9 @@ impl Matcher for StandardMatcher {
/// index of the group using the corresponding matcher's `capture_index`
/// method, and then use that index with `RegexCaptures::get`.
#[derive(Clone, Debug)]
pub struct RegexCaptures(RegexCapturesImp);
#[derive(Clone, Debug)]
enum RegexCapturesImp {
AhoCorasick {
/// The start and end of the match, corresponding to capture group 0.
mat: Option<Match>,
},
Regex {
/// Where the locations are stored.
locs: CaptureLocations,
pub struct RegexCaptures {
/// Where the captures are stored.
caps: AutomataCaptures,
/// These captures behave as if the capturing groups begin at the given
/// offset. When set to `0`, this has no affect and capture groups are
/// indexed like normal.
@@ -841,115 +795,37 @@ enum RegexCapturesImp {
/// the matcher and the capturing groups must behave as if `(re)` is
/// the `0`th capture group.
offset: usize,
/// When enable, the end of a match has `\r` stripped from it, if one
/// exists.
strip_crlf: bool,
},
}
impl Captures for RegexCaptures {
fn len(&self) -> usize {
match self.0 {
RegexCapturesImp::AhoCorasick { .. } => 1,
RegexCapturesImp::Regex { ref locs, offset, .. } => {
locs.len().checked_sub(offset).unwrap()
}
}
self.caps
.group_info()
.all_group_len()
.checked_sub(self.offset)
.unwrap()
}
fn get(&self, i: usize) -> Option<Match> {
match self.0 {
RegexCapturesImp::AhoCorasick { mat, .. } => {
if i == 0 {
mat
} else {
None
}
}
RegexCapturesImp::Regex { ref locs, offset, strip_crlf } => {
if !strip_crlf {
let actual = i.checked_add(offset).unwrap();
return locs.pos(actual).map(|(s, e)| Match::new(s, e));
}
// currently don't support capture offsetting with CRLF
// stripping
assert_eq!(offset, 0);
let m = match locs.pos(i).map(|(s, e)| Match::new(s, e)) {
None => return None,
Some(m) => m,
};
// If the end position of this match corresponds to the end
// position of the overall match, then we apply our CRLF
// stripping. Otherwise, we cannot assume stripping is correct.
if i == 0 || m.end() == locs.pos(0).unwrap().1 {
Some(m.with_end(m.end() - 1))
} else {
Some(m)
}
}
}
let actual = i.checked_add(self.offset).unwrap();
self.caps.get_group(actual).map(|sp| Match::new(sp.start, sp.end))
}
}
impl RegexCaptures {
pub(crate) fn simple() -> RegexCaptures {
RegexCaptures(RegexCapturesImp::AhoCorasick { mat: None })
}
pub(crate) fn new(locs: CaptureLocations) -> RegexCaptures {
RegexCaptures::with_offset(locs, 0)
pub(crate) fn new(caps: AutomataCaptures) -> RegexCaptures {
RegexCaptures::with_offset(caps, 0)
}
pub(crate) fn with_offset(
locs: CaptureLocations,
caps: AutomataCaptures,
offset: usize,
) -> RegexCaptures {
RegexCaptures(RegexCapturesImp::Regex {
locs,
offset,
strip_crlf: false,
})
RegexCaptures { caps, offset }
}
pub(crate) fn locations(&self) -> &CaptureLocations {
match self.0 {
RegexCapturesImp::AhoCorasick { .. } => {
panic!("getting locations for simple captures is invalid")
}
RegexCapturesImp::Regex { ref locs, .. } => locs,
}
}
pub(crate) fn locations_mut(&mut self) -> &mut CaptureLocations {
match self.0 {
RegexCapturesImp::AhoCorasick { .. } => {
panic!("getting locations for simple captures is invalid")
}
RegexCapturesImp::Regex { ref mut locs, .. } => locs,
}
}
pub(crate) fn strip_crlf(&mut self, yes: bool) {
match self.0 {
RegexCapturesImp::AhoCorasick { .. } => {
panic!("setting strip_crlf for simple captures is invalid")
}
RegexCapturesImp::Regex { ref mut strip_crlf, .. } => {
*strip_crlf = yes;
}
}
}
pub(crate) fn set_simple(&mut self, one: Option<Match>) {
match self.0 {
RegexCapturesImp::AhoCorasick { ref mut mat } => {
*mat = one;
}
RegexCapturesImp::Regex { .. } => {
panic!("setting simple captures for regex is invalid")
}
}
pub(crate) fn captures_mut(&mut self) -> &mut AutomataCaptures {
&mut self.caps
}
}
@@ -1036,7 +912,9 @@ mod tests {
}
// Test that finding candidate lines works as expected.
// FIXME: Re-enable this test once inner literal extraction works.
#[test]
#[ignore]
fn candidate_lines() {
fn is_confirmed(m: LineMatchKind) -> bool {
match m {

View File

@@ -1,6 +1,6 @@
use aho_corasick::{AhoCorasick, AhoCorasickBuilder, MatchKind};
use aho_corasick::{AhoCorasick, MatchKind};
use grep_matcher::{Match, Matcher, NoError};
use regex_syntax::hir::Hir;
use regex_syntax::hir::{Hir, HirKind};
use crate::error::Error;
use crate::matcher::RegexCaptures;
@@ -23,11 +23,10 @@ impl MultiLiteralMatcher {
pub fn new<B: AsRef<[u8]>>(
literals: &[B],
) -> Result<MultiLiteralMatcher, Error> {
let ac = AhoCorasickBuilder::new()
let ac = AhoCorasick::builder()
.match_kind(MatchKind::LeftmostFirst)
.auto_configure(literals)
.build_with_size::<usize, _, _>(literals)
.map_err(Error::regex)?;
.build(literals)
.map_err(Error::generic)?;
Ok(MultiLiteralMatcher { ac })
}
}
@@ -79,13 +78,11 @@ impl Matcher for MultiLiteralMatcher {
/// Alternation literals checks if the given HIR is a simple alternation of
/// literals, and if so, returns them. Otherwise, this returns None.
pub fn alternation_literals(expr: &Hir) -> Option<Vec<Vec<u8>>> {
use regex_syntax::hir::{HirKind, Literal};
// This is pretty hacky, but basically, if `is_alternation_literal` is
// true, then we can make several assumptions about the structure of our
// HIR. This is what justifies the `unreachable!` statements below.
if !expr.is_alternation_literal() {
if !expr.properties().is_alternation_literal() {
return None;
}
let alts = match *expr.kind() {
@@ -93,26 +90,16 @@ pub fn alternation_literals(expr: &Hir) -> Option<Vec<Vec<u8>>> {
_ => return None, // one literal isn't worth it
};
let extendlit = |lit: &Literal, dst: &mut Vec<u8>| match *lit {
Literal::Unicode(c) => {
let mut buf = [0; 4];
dst.extend_from_slice(c.encode_utf8(&mut buf).as_bytes());
}
Literal::Byte(b) => {
dst.push(b);
}
};
let mut lits = vec![];
for alt in alts {
let mut lit = vec![];
match *alt.kind() {
HirKind::Empty => {}
HirKind::Literal(ref x) => extendlit(x, &mut lit),
HirKind::Literal(ref x) => lit.extend_from_slice(&x.0),
HirKind::Concat(ref exprs) => {
for e in exprs {
match *e.kind() {
HirKind::Literal(ref x) => extendlit(x, &mut lit),
HirKind::Literal(ref x) => lit.extend_from_slice(&x.0),
_ => unreachable!("expected literal, got {:?}", e),
}
}

View File

@@ -1,9 +1,13 @@
use grep_matcher::ByteSet;
use regex_syntax::hir::{self, Hir, HirKind};
use regex_syntax::utf8::Utf8Sequences;
use {
grep_matcher::ByteSet,
regex_syntax::{
hir::{self, Hir, HirKind, Look},
utf8::Utf8Sequences,
},
};
/// Return a confirmed set of non-matching bytes from the given expression.
pub fn non_matching_bytes(expr: &Hir) -> ByteSet {
pub(crate) fn non_matching_bytes(expr: &Hir) -> ByteSet {
let mut set = ByteSet::full();
remove_matching_bytes(expr, &mut set);
set
@@ -13,18 +17,27 @@ pub fn non_matching_bytes(expr: &Hir) -> ByteSet {
/// the given expression.
fn remove_matching_bytes(expr: &Hir, set: &mut ByteSet) {
match *expr.kind() {
HirKind::Empty | HirKind::WordBoundary(_) => {}
HirKind::Anchor(_) => {
HirKind::Empty
| HirKind::Look(Look::WordAscii | Look::WordAsciiNegate)
| HirKind::Look(Look::WordUnicode | Look::WordUnicodeNegate) => {}
HirKind::Look(Look::Start | Look::End) => {
// FIXME: This is wrong, but not doing this leads to incorrect
// results because of how anchored searches are implemented in
// the 'grep-searcher' crate.
set.remove(b'\n');
}
HirKind::Literal(hir::Literal::Unicode(c)) => {
for &b in c.encode_utf8(&mut [0; 4]).as_bytes() {
HirKind::Look(Look::StartLF | Look::EndLF) => {
set.remove(b'\n');
}
HirKind::Look(Look::StartCRLF | Look::EndCRLF) => {
set.remove(b'\r');
set.remove(b'\n');
}
HirKind::Literal(hir::Literal(ref lit)) => {
for &b in lit.iter() {
set.remove(b);
}
}
HirKind::Literal(hir::Literal::Byte(b)) => {
set.remove(b);
}
HirKind::Class(hir::Class::Unicode(ref cls)) => {
for range in cls.iter() {
// This is presumably faster than encoding every codepoint
@@ -42,10 +55,10 @@ fn remove_matching_bytes(expr: &Hir, set: &mut ByteSet) {
}
}
HirKind::Repetition(ref x) => {
remove_matching_bytes(&x.hir, set);
remove_matching_bytes(&x.sub, set);
}
HirKind::Group(ref x) => {
remove_matching_bytes(&x.hir, set);
HirKind::Capture(ref x) => {
remove_matching_bytes(&x.sub, set);
}
HirKind::Concat(ref xs) => {
for x in xs {
@@ -62,17 +75,13 @@ fn remove_matching_bytes(expr: &Hir, set: &mut ByteSet) {
#[cfg(test)]
mod tests {
use grep_matcher::ByteSet;
use regex_syntax::ParserBuilder;
use {grep_matcher::ByteSet, regex_syntax::ParserBuilder};
use super::non_matching_bytes;
fn extract(pattern: &str) -> ByteSet {
let expr = ParserBuilder::new()
.allow_invalid_utf8(true)
.build()
.parse(pattern)
.unwrap();
let expr =
ParserBuilder::new().utf8(false).build().parse(pattern).unwrap();
non_matching_bytes(&expr)
}
@@ -131,9 +140,13 @@ mod tests {
#[test]
fn anchor() {
// FIXME: The first four tests below should correspond to a full set
// of bytes for the non-matching bytes I think.
assert_eq!(sparse(&extract(r"^")), sparse_except(&[b'\n']));
assert_eq!(sparse(&extract(r"$")), sparse_except(&[b'\n']));
assert_eq!(sparse(&extract(r"\A")), sparse_except(&[b'\n']));
assert_eq!(sparse(&extract(r"\z")), sparse_except(&[b'\n']));
assert_eq!(sparse(&extract(r"(?m)^")), sparse_except(&[b'\n']));
assert_eq!(sparse(&extract(r"(?m)$")), sparse_except(&[b'\n']));
}
}

View File

@@ -1,5 +1,7 @@
use grep_matcher::LineTerminator;
use regex_syntax::hir::{self, Hir, HirKind};
use {
grep_matcher::LineTerminator,
regex_syntax::hir::{self, Hir, HirKind},
};
use crate::error::{Error, ErrorKind};
@@ -15,7 +17,26 @@ use crate::error::{Error, ErrorKind};
///
/// If the given line terminator is not ASCII, then this function returns an
/// error.
pub fn strip_from_match(
///
/// Note that as of regex 1.9, this routine could theoretically be implemented
/// without returning an error. Namely, for example, we could turn
/// `foo\nbar` into `foo[a&&b]bar`. That is, replace line terminators with a
/// sub-expression that can never match anything. Thus, ripgrep would accept
/// such regexes and just silently not match anything. Regex versions prior to 1.8
/// don't support such constructs. I ended up deciding to leave the existing
/// behavior of returning an error instead. For example:
///
/// ```text
/// $ echo -n 'foo\nbar\n' | rg 'foo\nbar'
/// the literal '"\n"' is not allowed in a regex
///
/// Consider enabling multiline mode with the --multiline flag (or -U for short).
/// When multiline mode is enabled, new line characters can be matched.
/// ```
///
/// This looks like a good error message to me, and even suggests a flag that
/// the user can use instead.
pub(crate) fn strip_from_match(
expr: Hir,
line_term: LineTerminator,
) -> Result<Hir, Error> {
@@ -23,40 +44,34 @@ pub fn strip_from_match(
let expr1 = strip_from_match_ascii(expr, b'\r')?;
strip_from_match_ascii(expr1, b'\n')
} else {
let b = line_term.as_byte();
if b > 0x7F {
return Err(Error::new(ErrorKind::InvalidLineTerminator(b)));
}
strip_from_match_ascii(expr, b)
strip_from_match_ascii(expr, line_term.as_byte())
}
}
/// The implementation of strip_from_match. The given byte must be ASCII. This
/// function panics otherwise.
/// The implementation of strip_from_match. The given byte must be ASCII.
/// This function returns an error otherwise. It also returns an error if
/// it couldn't remove `\n` from the given regex without leaving an empty
/// character class in its place.
fn strip_from_match_ascii(expr: Hir, byte: u8) -> Result<Hir, Error> {
assert!(byte <= 0x7F);
let chr = byte as char;
assert_eq!(chr.len_utf8(), 1);
let invalid = || Err(Error::new(ErrorKind::NotAllowed(chr.to_string())));
if !byte.is_ascii() {
return Err(Error::new(ErrorKind::InvalidLineTerminator(byte)));
}
let ch = char::from(byte);
let invalid = || Err(Error::new(ErrorKind::NotAllowed(ch.to_string())));
Ok(match expr.into_kind() {
HirKind::Empty => Hir::empty(),
HirKind::Literal(hir::Literal::Unicode(c)) => {
if c == chr {
HirKind::Literal(hir::Literal(lit)) => {
if lit.iter().find(|&&b| b == byte).is_some() {
return invalid();
}
Hir::literal(hir::Literal::Unicode(c))
}
HirKind::Literal(hir::Literal::Byte(b)) => {
if b as char == chr {
return invalid();
}
Hir::literal(hir::Literal::Byte(b))
Hir::literal(lit)
}
HirKind::Class(hir::Class::Unicode(mut cls)) => {
if cls.ranges().is_empty() {
return Ok(Hir::class(hir::Class::Unicode(cls)));
}
let remove = hir::ClassUnicode::new(Some(
hir::ClassUnicodeRange::new(chr, chr),
hir::ClassUnicodeRange::new(ch, ch),
));
cls.difference(&remove);
if cls.ranges().is_empty() {
@@ -65,6 +80,9 @@ fn strip_from_match_ascii(expr: Hir, byte: u8) -> Result<Hir, Error> {
Hir::class(hir::Class::Unicode(cls))
}
HirKind::Class(hir::Class::Bytes(mut cls)) => {
if cls.ranges().is_empty() {
return Ok(Hir::class(hir::Class::Bytes(cls)));
}
let remove = hir::ClassBytes::new(Some(
hir::ClassBytesRange::new(byte, byte),
));
@@ -74,15 +92,14 @@ fn strip_from_match_ascii(expr: Hir, byte: u8) -> Result<Hir, Error> {
}
Hir::class(hir::Class::Bytes(cls))
}
HirKind::Anchor(x) => Hir::anchor(x),
HirKind::WordBoundary(x) => Hir::word_boundary(x),
HirKind::Look(x) => Hir::look(x),
HirKind::Repetition(mut x) => {
x.hir = Box::new(strip_from_match_ascii(*x.hir, byte)?);
x.sub = Box::new(strip_from_match_ascii(*x.sub, byte)?);
Hir::repetition(x)
}
HirKind::Group(mut x) => {
x.hir = Box::new(strip_from_match_ascii(*x.hir, byte)?);
Hir::group(x)
HirKind::Capture(mut x) => {
x.sub = Box::new(strip_from_match_ascii(*x.sub, byte)?);
Hir::capture(x)
}
HirKind::Concat(xs) => {
let xs = xs
@@ -131,11 +148,11 @@ mod tests {
#[test]
fn various() {
assert_eq!(roundtrip(r"[a\n]", b'\n'), "[a]");
assert_eq!(roundtrip(r"[a\n]", b'a'), "[\n]");
assert_eq!(roundtrip_crlf(r"[a\n]"), "[a]");
assert_eq!(roundtrip_crlf(r"[a\r]"), "[a]");
assert_eq!(roundtrip_crlf(r"[a\r\n]"), "[a]");
assert_eq!(roundtrip(r"[a\n]", b'\n'), "a");
assert_eq!(roundtrip(r"[a\n]", b'a'), "\n");
assert_eq!(roundtrip_crlf(r"[a\n]"), "a");
assert_eq!(roundtrip_crlf(r"[a\r]"), "a");
assert_eq!(roundtrip_crlf(r"[a\r\n]"), "a");
assert_eq!(roundtrip(r"(?-u)\s", b'a'), r"(?-u:[\x09-\x0D\x20])");
assert_eq!(roundtrip(r"(?-u)\s", b'\n'), r"(?-u:[\x09\x0B-\x0D\x20])");

View File

@@ -1,29 +0,0 @@
/// Converts an arbitrary sequence of bytes to a literal suitable for building
/// a regular expression.
pub fn bytes_to_regex(bs: &[u8]) -> String {
use regex_syntax::is_meta_character;
use std::fmt::Write;
let mut s = String::with_capacity(bs.len());
for &b in bs {
if b <= 0x7F && !is_meta_character(b as char) {
write!(s, r"{}", b as char).unwrap();
} else {
write!(s, r"\x{:02x}", b).unwrap();
}
}
s
}
/// Converts arbitrary bytes to a nice string.
pub fn show_bytes(bs: &[u8]) -> String {
use std::ascii::escape_default;
use std::str;
let mut nice = String::new();
for &b in bs {
let part: Vec<u8> = escape_default(b).collect();
nice.push_str(str::from_utf8(&part).unwrap());
}
nice
}

View File

@@ -1,39 +1,59 @@
use std::cell::RefCell;
use std::collections::HashMap;
use std::sync::Arc;
use std::{
collections::HashMap,
panic::{RefUnwindSafe, UnwindSafe},
sync::Arc,
};
use grep_matcher::{Match, Matcher, NoError};
use regex::bytes::{CaptureLocations, Regex};
use thread_local::ThreadLocal;
use {
grep_matcher::{Match, Matcher, NoError},
regex_automata::{
meta::Regex, util::captures::Captures, util::pool::Pool, Input,
PatternID,
},
};
use crate::config::ConfiguredHIR;
use crate::error::Error;
use crate::matcher::RegexCaptures;
use crate::{config::ConfiguredHIR, error::Error, matcher::RegexCaptures};
type PoolFn =
Box<dyn Fn() -> Captures + Send + Sync + UnwindSafe + RefUnwindSafe>;
/// A matcher for implementing "word match" semantics.
#[derive(Debug)]
pub struct WordMatcher {
pub(crate) struct WordMatcher {
/// The regex which is roughly `(?:^|\W)(<original pattern>)(?:$|\W)`.
regex: Regex,
/// The HIR that produced the regex above. We don't keep the HIR for the
/// `original` regex.
///
/// We put this in an `Arc` because by the time it gets here, it won't
/// change. And because cloning and dropping an `Hir` is somewhat expensive
/// due to its deep recursive representation.
chir: Arc<ConfiguredHIR>,
/// The original regex supplied by the user, which we use in a fast path
/// to try and detect matches before deferring to slower engines.
original: Regex,
/// A map from capture group name to capture group index.
names: HashMap<String, usize>,
/// A reusable buffer for finding the match location of the inner group.
locs: Arc<ThreadLocal<RefCell<CaptureLocations>>>,
/// A thread-safe pool of reusable buffers for finding the match offset of
/// the inner group.
caps: Arc<Pool<Captures, PoolFn>>,
}
impl Clone for WordMatcher {
fn clone(&self) -> WordMatcher {
// We implement Clone manually so that we get a fresh ThreadLocal such
// that it can set its own thread owner. This permits each thread
// usings `locs` to hit the fast path.
// We implement Clone manually so that we get a fresh Pool such that it
// can set its own thread owner. This permits each thread usings `caps`
// to hit the fast path.
//
// Note that cloning a regex is "cheap" since it uses reference
// counting internally.
let re = self.regex.clone();
WordMatcher {
regex: self.regex.clone(),
chir: Arc::clone(&self.chir),
original: self.original.clone(),
names: self.names.clone(),
locs: Arc::new(ThreadLocal::new()),
caps: Arc::new(Pool::new(Box::new(move || re.create_captures()))),
}
}
}
@@ -44,31 +64,38 @@ impl WordMatcher {
///
/// The given options are used to construct the regular expression
/// internally.
pub fn new(expr: &ConfiguredHIR) -> Result<WordMatcher, Error> {
let original =
expr.with_pattern(|pat| format!("^(?:{})$", pat))?.regex()?;
let word_expr = expr.with_pattern(|pat| {
let pat = format!(r"(?:(?m:^)|\W)({})(?:\W|(?m:$))", pat);
log::debug!("word regex: {:?}", pat);
pat
})?;
let regex = word_expr.regex()?;
let locs = Arc::new(ThreadLocal::new());
pub(crate) fn new(chir: ConfiguredHIR) -> Result<WordMatcher, Error> {
let original = chir.clone().into_anchored().to_regex()?;
let chir = Arc::new(chir.into_word()?);
let regex = chir.to_regex()?;
let caps = Arc::new(Pool::new({
let regex = regex.clone();
Box::new(move || regex.create_captures()) as PoolFn
}));
let mut names = HashMap::new();
for (i, optional_name) in regex.capture_names().enumerate() {
let it = regex.group_info().pattern_names(PatternID::ZERO);
for (i, optional_name) in it.enumerate() {
if let Some(name) = optional_name {
names.insert(name.to_string(), i.checked_sub(1).unwrap());
}
}
Ok(WordMatcher { regex, original, names, locs })
Ok(WordMatcher { regex, chir, original, names, caps })
}
/// Return the underlying regex used by this matcher.
pub fn regex(&self) -> &Regex {
/// Return the underlying regex used to match at word boundaries.
///
/// The original regex is in the capture group at index 1.
pub(crate) fn regex(&self) -> &Regex {
&self.regex
}
/// Return the underlying HIR for the regex used to match at word
/// boundaries.
pub(crate) fn chir(&self) -> &ConfiguredHIR {
&self.chir
}
/// Attempt to do a fast confirmation of a word match that covers a subset
/// (but hopefully a big subset) of most cases. Ok(Some(..)) is returned
/// when a match is found. Ok(None) is returned when there is definitively
@@ -79,12 +106,11 @@ impl WordMatcher {
haystack: &[u8],
at: usize,
) -> Result<Option<Match>, ()> {
// This is a bit hairy. The whole point here is to avoid running an
// NFA simulation in the regex engine. Remember, our word regex looks
// like this:
// This is a bit hairy. The whole point here is to avoid running a
// slower regex engine to extract capture groups. Remember, our word
// regex looks like this:
//
// (^|\W)(<original regex>)($|\W)
// where ^ and $ have multiline mode DISABLED
// (^|\W)(<original regex>)(\W|$)
//
// What we want are the match offsets of <original regex>. So in the
// easy/common case, the original regex will be sandwiched between
@@ -102,7 +128,8 @@ impl WordMatcher {
// The reason why we cannot handle the ^/$ cases here is because we
// can't assume anything about the original pattern. (Try commenting
// out the checks for ^/$ below and run the tests to see examples.)
let mut cand = match self.regex.find_at(haystack, at) {
let input = Input::new(haystack).span(at..haystack.len());
let mut cand = match self.regex.find(input) {
None => return Ok(None),
Some(m) => Match::new(m.start(), m.end()),
};
@@ -145,23 +172,23 @@ impl Matcher for WordMatcher {
//
// OK, well, it turns out that it is worth it! But it is quite tricky.
// See `fast_find` for details. Effectively, this lets us skip running
// the NFA simulation in the regex engine in the vast majority of
// cases. However, the NFA simulation is required for full correctness.
// a slower regex engine to extract capture groups in the vast majority
// of cases. However, the slower engine is I believe required for full
// correctness.
match self.fast_find(haystack, at) {
Ok(Some(m)) => return Ok(Some(m)),
Ok(None) => return Ok(None),
Err(()) => {}
}
let cell =
self.locs.get_or(|| RefCell::new(self.regex.capture_locations()));
let mut caps = cell.borrow_mut();
self.regex.captures_read_at(&mut caps, haystack, at);
Ok(caps.get(1).map(|m| Match::new(m.0, m.1)))
let input = Input::new(haystack).span(at..haystack.len());
let mut caps = self.caps.get();
self.regex.search_captures(&input, &mut caps);
Ok(caps.get_group(1).map(|sp| Match::new(sp.start, sp.end)))
}
fn new_captures(&self) -> Result<RegexCaptures, NoError> {
Ok(RegexCaptures::with_offset(self.regex.capture_locations(), 1))
Ok(RegexCaptures::with_offset(self.regex.create_captures(), 1))
}
fn capture_count(&self) -> usize {
@@ -178,9 +205,10 @@ impl Matcher for WordMatcher {
at: usize,
caps: &mut RegexCaptures,
) -> Result<bool, NoError> {
let r =
self.regex.captures_read_at(caps.locations_mut(), haystack, at);
Ok(r.is_some())
let input = Input::new(haystack).span(at..haystack.len());
let caps = caps.captures_mut();
self.regex.search_captures(&input, caps);
Ok(caps.is_match())
}
// We specifically do not implement other methods like find_iter or
@@ -195,8 +223,8 @@ mod tests {
use grep_matcher::{Captures, Match, Matcher};
fn matcher(pattern: &str) -> WordMatcher {
let chir = Config::default().hir(pattern).unwrap();
WordMatcher::new(&chir).unwrap()
let chir = Config::default().build_many(&[pattern]).unwrap();
WordMatcher::new(chir).unwrap()
}
fn find(pattern: &str, haystack: &str) -> Option<(usize, usize)> {

View File

@@ -1,6 +1,6 @@
[package]
name = "grep-searcher"
version = "0.1.10" #:version
version = "0.1.11" #:version
authors = ["Andrew Gallant <jamslam@gmail.com>"]
description = """
Fast line oriented regex searching as a library.
@@ -14,7 +14,7 @@ license = "Unlicense OR MIT"
edition = "2018"
[dependencies]
bstr = { version = "1.1.0", default-features = false, features = ["std"] }
bstr = { version = "1.6.0", default-features = false, features = ["std"] }
bytecount = "0.6"
encoding_rs = "0.8.14"
encoding_rs_io = "0.1.6"

View File

@@ -71,16 +71,6 @@ impl MmapChoice {
if !self.is_enabled() {
return None;
}
if !cfg!(target_pointer_width = "64") {
// For 32-bit systems, it looks like mmap will succeed even if it
// can't address the entire file. This seems to happen at least on
// Windows, even though it uses to work prior to ripgrep 13. The
// only Windows-related change in ripgrep 13, AFAIK, was statically
// linking vcruntime. So maybe that's related? But I'm not sure.
//
// See: https://github.com/BurntSushi/ripgrep/issues/1911
return None;
}
if cfg!(target_os = "macos") {
// I guess memory maps on macOS aren't great. Should re-evaluate.
return None;