Compare commits

...

230 Commits

Author SHA1 Message Date
Andrew Gallant
785c1f1766 release: globset, grep-cli, grep-printer, grep-searcher 2019-06-26 16:53:30 -04:00
Andrew Gallant
8b734cb490 deps: update everything 2019-06-26 16:51:06 -04:00
Andrew Gallant
b93762ea7a bstr: update everything to bstr 0.2 2019-06-26 16:47:33 -04:00
Andrew Gallant
34677d2622 search: a few small touchups 2019-06-18 20:23:47 -04:00
Andrew Gallant
d1389db2e3 search: better errors for preprocessor commands
If a preprocessor command could not be started, we now show some
additional context with the error message. Previously, it showed
something like this:

  some/file: No such file or directory (os error 2)

Which is itself pretty misleading. Now it shows:

  some/file: preprocessor command could not start: '"nonexist" "some/file"': No such file or directory (os error 2)

Fixes #1302
2019-06-16 19:02:02 -04:00
Andrew Gallant
50bcb7409e deps: update everything 2019-06-16 18:38:45 -04:00
Andrew Gallant
7b9972c308 style: fix deprecations
Use `dyn` for trait objects and use `..=` for inclusive ranges.
2019-06-16 18:37:51 -04:00
Hitesh Jasani
9f000c2910 ignore/types: add more nim types
PR #1297
2019-06-12 14:02:28 -04:00
skierpage
392682d352 doc: point regex doc link to the latest version
The latest doc is different, e.g. adds "symmetric differences" under
https://docs.rs/regex/*/regex/#character-classes

PR #1287
2019-06-01 08:44:55 -04:00
Andrew Gallant
7d3f794588 ignore: remove .git check in some cases
When we know we aren't going to process gitignores, we shouldn't waste
the syscall in every directory to check for a git repo.
2019-05-29 18:06:11 -04:00
bruce-one
290fd2a7b6 readme: mention Zstandard and Brotli
Also alphabetise the list.

PR #1288
2019-05-29 13:37:31 -04:00
Fabian Würfl
d1e4d28f30 readme: remove outdated statement
Issue #10 already states that "ripgrep is now in most or all of the major
package repositories."

PR #1280
2019-05-14 18:44:50 -04:00
Andrew Gallant
5ce2d7351d ci: use cross for musl x86_64 builds
This is necessary because jemalloc + musl + Ubuntu 16.04 is apparently
broken.

Moreover, jemalloc doesn't support i686, so we accept the performance
regression there.

See also: https://github.com/gnzlbg/jemallocator/issues/124
2019-04-25 11:12:14 -04:00
Andrew Gallant
9dcfd9a205 deps: bump pcre2-sys to 0.2.1
This brings in a bug fix that no longer tries to run `git` to update the
submodule if the `git` command doesn't exist.

This is useful is more restricted build contexts where `git` isn't
installed. Such as in the docker image used for running `cross`.
2019-04-25 11:12:14 -04:00
Andrew Gallant
36b276c6d0 printer: remove unnecessary mut 2019-04-24 17:22:27 -04:00
Andrew Gallant
03bf37ff4a alloc: use jemalloc when building with musl
It turns out that musl's allocator is slow enough to cause a fairly
noticeable performance regression when ripgrep is built as a static
binary with musl. We fix this by using jemalloc when building with musl.

We continue to use the default system allocator in all other scenarios.
Namely, glibc's allocator doesn't noticeably regress performance compared
to jemalloc. But we could add more targets to this logic if other
system allocators (macOS, Windows) prove to be slow.

This wasn't necessary before because rustc recently stopped using jemalloc
by default.

Fixes #1268
2019-04-24 17:21:38 -04:00
Andrew Gallant
e7829c05d3 cli: fix bug where last byte was stripped
In an effort to strip line terminators, we assumed their existence. But
a pattern file may not end with a line terminator, so we shouldn't
unconditionally strip them.

We fix this by moving to bstr's line handling, which does this for us
automatically.
2019-04-19 07:11:44 -04:00
Rory O’Kane
a6222939f9 readme: mention --pcre2 as long form of -P
This is for consistency with the short and long flags given in other
bullet points. I originally assumed there was no long flag for `-P`
because none was given here.

PR #1254
2019-04-16 21:22:48 -04:00
Rory O’Kane
6ffd434232 readme: mention --auto-hybrid-regex in advantages
This feature solves a major reason I was skeptical of using ripgrep, so
I think it’s good to mention it in the section about why one should use
it.

I use backreferences a lot, so I had previously thought that ripgrep
would provide no speed advantage over ag, since I would always have
`-P` enabled. But when I saw `--auto-hybrid-regex` in the 11.0.0
changelog, I learned that ripgrep can use it to speed up simple queries
while still allowing me to write backreferences.

PR #1253
2019-04-16 17:21:40 -04:00
Andrew Gallant
1f1cd9b467 pkg: update brew tap to 11.0.1 2019-04-16 13:39:56 -04:00
Andrew Gallant
973de50c9e ripgrep: release 11.0.1, take 2 2019-04-16 13:11:28 -04:00
Andrew Gallant
5f8805a496 ripgrep: release 11.0.1 2019-04-16 13:10:29 -04:00
Andrew Gallant
fdde2bcd38 deps: update regex to 1.1.6
This brings in a fix for a regression introduced in ripgrep 11.

Fixes #1247
2019-04-16 08:34:30 -04:00
Gerard de Melo
7b3fe6b325 doc: fix typo in FAQ
PR #1248
2019-04-16 08:32:30 -04:00
Max Horn
b3dd3ae203 ignore/types: add GAP
Add support for file types used by the GAP language, a research system
computational discrete algebra, see <https://www.gap-system.org>

PR #1249
2019-04-16 08:31:58 -04:00
Andrew Gallant
f3083e4574 readme: remove brew tap instructions
The brew tap isn't really needed any more, since SIMD is now
automatically enabled in all binaries.
2019-04-15 18:32:33 -04:00
Andrew Gallant
d03e30707e pkg: update brew tap to 11.0.0 2019-04-15 18:32:10 -04:00
Andrew Gallant
d7f57d9aab ripgrep: release 11.0.0 2019-04-15 18:09:40 -04:00
Andrew Gallant
1a2a24ea74 grep: release 0.2.4 2019-04-15 18:03:46 -04:00
Andrew Gallant
d66610b295 grep-cli: release 0.1.2 2019-04-15 18:02:44 -04:00
Andrew Gallant
019ae1989b grep-printer: release 0.1.2 2019-04-15 18:00:49 -04:00
Andrew Gallant
36d3f235dc grep-searcher: release 0.1.4 2019-04-15 17:59:22 -04:00
Andrew Gallant
79018eb693 grep-pcre2: release 0.1.3 2019-04-15 17:57:03 -04:00
Andrew Gallant
44cd344438 grep-regex: release 0.1.3 2019-04-15 17:56:04 -04:00
Andrew Gallant
e493e54b9b grep-matcher: release 0.1.2 2019-04-15 17:53:29 -04:00
Andrew Gallant
8e8215aa65 ignore: release 0.4.7 2019-04-15 17:50:37 -04:00
Andrew Gallant
3fe701498e doc: add note about --pre-glob
There was a performance warning in the --pre docs, but didn't mention
--pre-glob as a possible mitigation to it.
2019-04-15 17:47:48 -04:00
Andrew Gallant
e79085e9e4 release: globset 0.4.3 2019-04-15 14:07:03 -04:00
Andrew Gallant
764c197022 complete: fix typo 2019-04-15 07:04:57 -04:00
Andrew Gallant
ef1611b5f5 ripgrep: max-column-preview --> max-columns-preview
Credit to @okdana for catching this. This naming is a bit more
consistent with the existing --max-columns flag.
2019-04-15 06:51:51 -04:00
Andrew Gallant
45d12abbc5 changelog: small fixups 2019-04-14 20:21:55 -04:00
Andrew Gallant
5fde8391f9 changelog: backfill it
I went through every commit since the 0.10.0 release and added anything
that I thought was missing.
2019-04-14 20:04:01 -04:00
Marco Herrn
3edb11c513 ignore/types: add additional java files
- .jspx for XHTML JSP files
- .properties for Java Properties files (resource bundles, etc.)

Closes #1242
2019-04-14 19:38:24 -04:00
Andrew Gallant
ed144be775 ci: bump MSRV to 1.34.0 2019-04-14 19:29:27 -04:00
Andrew Gallant
967e7ad0de ripgrep: add --auto-hybrid-regex flag
This flag, when set, will automatically dispatch to PCRE2 if the given
regex cannot be compiled by Rust's regex engine. If both engines fail to
compile the regex, then both errors are surfaced.

Closes #1155
2019-04-14 19:29:27 -04:00
Andrew Gallant
9952ba2068 deps: update glob dev-dependency 2019-04-14 19:29:27 -04:00
Andrew Gallant
b751758d60 deps: update everything 2019-04-14 19:29:27 -04:00
Andrew Gallant
8f14cb18a5 ripgrep: increase pcre2's default JIT stack size
The default stack size is 32KB, and this increases it to 10MB. 32KB is
pretty paltry in the environments in which ripgrep runs, and 10MB is
easily afforded as a maximum size. (The size limit we set for Rust's
regex engine is considerably larger.)

This was motivated due to the fack that JIT stack limits have been
observed to be hit in the wild:
https://github.com/Microsoft/vscode/issues/64606
2019-04-14 19:29:27 -04:00
Andrew Gallant
da9d720431 ripgrep: add --pcre2-version flag
This flag will output details about the version of PCRE2 that ripgrep
is using (if any).
2019-04-14 19:29:27 -04:00
Andrew Gallant
a9d71a0368 pcre2: add a few re-exports
This adds the top-level is_jit_available and version free functions from
the underlying pcre2 crate, and also forwards the max_jit_stack_size
option.
2019-04-14 19:29:27 -04:00
Andrew Gallant
f3646242cc deps: use pcre2 0.2.0
This comes with PCRE 10.32 and a few new options we'll use in subsequent
commits.
2019-04-14 19:29:27 -04:00
Andrew Gallant
601f212a0b ripgrep: add -I as a short option for --no-filename
This flag is commonly used in pipelines and it can be annoying to write
it out every time you need it.

Ideally, we would use -h for this to match GNU grep, but -h is used to
print help output.

Closes #1185
2019-04-14 19:29:27 -04:00
Andrew Gallant
5a565354f8 versioning: next version will be ripgrep 11
This sets up the release announcement and briefly describes the
versioning change. The actual version change itself won't happen until
the release.

Closes #1172
2019-04-14 19:29:27 -04:00
Andrew Gallant
2a6532ae71 doc: note cases of exorbitant memory usage
Fixes #1189
2019-04-14 19:29:27 -04:00
Andrew Gallant
ece1f50cfe printer: support previews for long lines
This commit adds support for showing a preview of long lines. While the
default still remains as completely suppressing the entire line, this
new functionality will show the first N graphemes of a matching line,
including the number of matches that are suppressed.

This was unfortunately a fairly invasive change to the printer that
required a bit of refactoring. On the bright side, the single line
and multi-line coloring are now more unified than they were before.

Closes #1078
2019-04-14 19:29:27 -04:00
Andrew Gallant
a7d26c8f14 binary: rejigger ripgrep's handling of binary files
This commit attempts to surface binary filtering in a slightly more
user friendly way. Namely, before, ripgrep would silently stop
searching a file if it detected a NUL byte, even if it had previously
printed a match. This can lead to the user quite reasonably assuming
that there are no more matches, since a partial search is fairly
unintuitive. (ripgrep has this behavior by default because it really
wants to NOT search binary files at all, just like it doesn't search
gitignored or hidden files.)

With this commit, if a match has already been printed and ripgrep detects
a NUL byte, then it will print a warning message indicating that the search
stopped prematurely.

Moreover, this commit adds a new flag, --binary, which causes ripgrep to
stop filtering binary files, but in a way that still avoids dumping
binary data into terminals. That is, the --binary flag makes ripgrep
behave more like grep's default behavior.

For files explicitly specified in a search, e.g., `rg foo some-file`,
then no binary filtering is applied (just like no gitignore and no
hidden file filtering is applied). Instead, ripgrep behaves as if you
gave the --binary flag for all explicitly given files.

This was a fairly invasive change, and potentially increases the UX
complexity of ripgrep around binary files. (Before, there were two
binary modes, where as now there are three.) However, ripgrep is now a
bit louder with warning messages when binary file detection might
otherwise be hiding potential matches, so hopefully this is a net
improvement.

Finally, the `-uuu` convenience now maps to `--no-ignore --hidden
--binary`, since this is closer to the actualy intent of the
`--unrestricted` flag, i.e., to reduce ripgrep's smart filtering. As a
consequence, `rg -uuu foo` should now search roughly the same number of
bytes as `grep -r foo`, and `rg -uuua foo` should search roughly the
same number of bytes as `grep -ra foo`. (The "roughly" weasel word is
used because grep's and ripgrep's binary file detection might differ
somewhat---perhaps based on buffer sizes---which can impact exactly what
is and isn't searched.)

See the numerous tests in tests/binary.rs for intended behavior.

Fixes #306, Fixes #855
2019-04-14 19:29:27 -04:00
Andrew Gallant
bd222ae93f regex: fix HIR analysis bug
An alternate can be empty at this point, so we must handle it. We didn't
before because the regex engine actually disallows empty alternates,
however, this code runs before the regex compiler rejects the regex.
2019-04-14 19:29:27 -04:00
hupfdule
4359d8aac0 ignore/types: add more extensions for xml
This includes:

    *.dtd for Document Type Definitions
    *.xsl and *.xslt for XSL Transformation descriptions
    *.xsd for XML Schema definitions
    *.xjb for JAXB bindings
    *.rng for Relax NG files
    *.sch for Schematron files

PR #1243
2019-04-09 15:17:57 -04:00
tonypai
308819fb1f ignore/types: add lock files
Treat anything with a `.lock` extension as a lock file, with
an extra rule or two for special cases, e.g., package-lock.json.
2019-04-09 10:24:48 -04:00
Andrew Gallant
09108b7fda regex: make multi-literal searcher faster
This makes the case of searching for a dictionary of a very large number
of literals much much faster. (~10x or so.) In particular, we achieve this
by short-circuiting the construction of a full regex when we know we have
a simple alternation of literals. Building the regex for a large dictionary
(>100,000 literals) turns out to be quite slow, even if it internally will
dispatch to Aho-Corasick.

Even that isn't quite enough. It turns out that even *parsing* such a regex
is quite slow. So when the -F/--fixed-strings flag is set, we short
circuit regex parsing completely and jump straight to Aho-Corasick.

We aren't quite as fast as GNU grep here, but it's much closer (less than
2x slower).

In general, this is somewhat of a hack. In particular, it seems plausible
that this optimization could be implemented entirely in the regex engine.
Unfortunately, the regex engine's internals are just not amenable to this
at all, so it would require a larger refactoring effort. For now, it's
good enough to add this fairly simple hack at a higher level.

Unfortunately, if you don't pass -F/--fixed-strings, then ripgrep will
be slower, because of the aforementioned missing optimization. Moreover,
passing flags like `-i` or `-S` will cause ripgrep to abandon this
optimization and fall back to something potentially much slower. Again,
this fix really needs to happen inside the regex engine, although we
might be able to special case -i when the input literals are pure ASCII
via Aho-Corasick's `ascii_case_insensitive`.

Fixes #497, Fixes #838
2019-04-07 19:11:03 -04:00
Andrew Gallant
743d64f2e4 deps: update to clap 2.33 2019-04-06 10:35:08 -04:00
lesnyrumcajs
5962abc465 searcher: add option to disable BOM sniffing
This commit adds a new encoding feature where the -E/--encoding flag
will now accept a value of 'none'. When given this value, all encoding
related machinery is disabled and ripgrep will search the raw bytes of
the file, including the BOM if it's present.

Closes #1207, Closes #1208
2019-04-06 10:35:08 -04:00
dana
1604a18db3 ignore/types: add *.am and *.in for C/C++/make
PR #1205
2019-04-06 08:02:04 -04:00
luzpaz
9eeb0b01ce readme: add Repology badge
This adds a badge to the README.md file indicating to users that click
on it if their os/distro carries that latest version of ripgrep.

PR #1213
2019-04-06 08:00:40 -04:00
dana
df4400209a ripgrep: remove extra new-line after Clap output
PR #1222
2019-04-06 07:59:36 -04:00
Andrew Gallant
77439f99a4 deps: add bstr to Cargo.lock 2019-04-05 23:24:08 -04:00
Andrew Gallant
be7d6dd9ce regex: print out final regex in trace mode
This is useful for debugging to see what regex is actually being run.
We put this as a trace since the regex can be quite gnarly. (It is not
pretty printed.)
2019-04-05 23:24:08 -04:00
Andrew Gallant
9f15e3b671 regex: fix a perf bug when using -w flag
When looking for an inner literal to speed up searches, if only a prefix
is found, then we generally give up doing inner literal optimizations since
the regex engine will generally handle it for us. Unfortunately, this
decision was being made *before* we wrap the regex in (^|\W)...($|\W) when
using the -w/--word-regexp flag, which would then defeat the literal
optimizations inside the regex engine.

We fix this with a bit of a hack that says, "if we're doing a word regexp,
then give me back any literal you find, even if it's a prefix."
2019-04-05 23:24:08 -04:00
Andrew Gallant
254b8b67bb globset: small perf improvements
This tweaks the path handling functions slightly to make them a hair
faster. In particular, `file_name` is called on every path that ripgrep
visits, and it was possible to remove a few branches without changing
behavior.
2019-04-05 23:24:08 -04:00
Andrew Gallant
8a7f43b84d globset: use bstr
This simplifies the various path related functions and pushed more platform
dependent code down into bstr. This likely also makes things a bit more
efficient on Windows, since we now only do a single UTF-8 check for each
file path.
2019-04-05 23:24:08 -04:00
Andrew Gallant
d968a27ed5 cli: use bstr
This uses bstr in the unescaping logic. This lets us remove some platform
specific code, and also lets us remove a hacked UTF-8 decoder on raw
bytes.
2019-04-05 23:24:08 -04:00
Andrew Gallant
9b8f5cbaba config: switch to using bstrs
This lets us implement correct Unicode trimming and also simplifies the
parsing logic a bit. This also removes the last platform specific bits of
code in ripgrep core.
2019-04-05 23:24:08 -04:00
Andrew Gallant
c52da74ac3 printer: use bstr
This starts the usage of bstr in the printer. We don't use it too much
yet, but it comes in handy for implementing PrinterPath and lets us push
down some platform specific code into bstr.
2019-04-05 23:24:08 -04:00
Andrew Gallant
7dcbff9a9b searcher: partially migrate to bstr
This commit causes grep-searcher to use byte strings internally for its
line buffer support. We manage to remove a use of `unsafe` by doing this
(by pushing it down into `bstr`).

We stop short of using byte strings everywhere else because we rely
heavily on the `impl ops::Index<[u8]> for grep_matcher::Match` impl,
which isn't available for byte strings. (It is premature to make bstr a
public dep of a core crate like grep-matcher, but maybe some day.)
2019-04-05 23:24:08 -04:00
Andrew Gallant
bef1f0e770 ci: switch to xenial (#1234)
Rust is having problems with trusty, in particular, see this bug I
filed: https://github.com/rust-lang/rust/issues/59411

This was purpotedly fixed in
https://github.com/rust-lang/rust/pull/59468,
but it appears the issue is still occurring.

This commit tries to update to Ubuntu 16.04 in the hope that it will fix
this problem.
2019-04-03 19:52:34 -04:00
Andrew Gallant
cd9815cb37 deps: update to aho-corasick 0.7
We do the simplest possible change to migrate to the new version.

Fixes #1228
2019-04-03 13:51:26 -04:00
Andrew Gallant
3f22c3a658 deps: update everything
This updates all dependencies to their latest versions.

We tolerate a duplicative aho-corasick for now, which we will fix in the
next commit.
2019-04-03 13:07:26 -04:00
Andrew Gallant
0913972104 deps: bump encoding_rs_io
This brings in a new API for disabling BOM sniffing.

This is part of the work toward completing
https://github.com/BurntSushi/ripgrep/issues/1207
2019-03-03 16:36:34 -05:00
Andrew Gallant
f19b84fb23 regex: bump regex dep to fix match bug
See

* 661bf53d5b
* edf45e6f5f

for details on the bug fix, which was in the regex engine.

Fixes #1203
2019-02-27 17:42:14 -05:00
Andrew Gallant
59fc583aeb readme: include details about filtering
Despite the fact that we mention this in several places, people are
still surprised by ripgrep's "smart" filtering.
2019-02-27 08:01:23 -05:00
Andrew Gallant
1c7c4e6640 deps: update tempfile 2019-02-21 16:32:17 -05:00
Andrew Gallant
69c5e3938d deps: bump smallvec
This gets rid of the unmaintained crates `unreachable` and `void`. Yay!
2019-02-21 16:31:48 -05:00
Andrew Gallant
d9cf05ad50 deps: update to aho-corasick 0.6.10
This brings in a fix for this bug:
https://github.com/BurntSushi/aho-corasick/issues/37

Fixes #1079
2019-02-16 11:39:33 -05:00
Andrew Gallant
af8b6caebb deps: update various dependencies 2019-02-16 09:39:42 -05:00
Andrew Gallant
c84cfb6756 grep-regex-0.1.2 2019-02-16 09:30:06 -05:00
Andrew Gallant
895e26a000 ci: don't do releases on all tags
This attempts to make Appveyor more conservative in what tags it thinks
are releases. I don't know for sure, but it looks like the previous
regex could match anywhere, so we anchor it.

Fixes #1195
2019-02-10 12:51:56 -05:00
Andrew Gallant
8c95290ff6 deps: miscellaneous updates 2019-02-10 07:45:08 -05:00
Andrew Gallant
d6feeb7ff2 grep-searcher-0.1.3 2019-02-10 07:42:37 -05:00
Andrew Gallant
626ed00c19 searcher: revert big-endian patch
This undoes the patch to stop using bytecount on big-endian
architectures. In particular, we bump our bytecount dependency to the
latest release, which has a fix.

This reverts commit a4868b8835.

Fixes #1144 (again), Closes #1194
2019-02-10 07:40:32 -05:00
Andrew Gallant
332ad18401 tests: use const constructor for atomics
We did this in 05411b2b for core ripgrep, but didn't carry it over to
tests.
2019-02-09 16:27:25 -05:00
Andrew Gallant
fc3cf41247 grep-searcher-0.1.2 2019-02-09 16:13:07 -05:00
Andrew Gallant
a4868b8835 searcher: use naive line counting on big-endian
This patches out bytecount's "fast" vectorized algorithm on big-endian
machines, where it has been observed to fail. Going forward, bytecount
should probably fix this on their end, but for now, we take a small
performance hit on big-endian machines.

Fixes #1144
2019-02-09 16:13:07 -05:00
John Schmidt
f99b991117 ignore/types: add zig
PR #1191
2019-02-08 08:12:40 -05:00
Andrew Gallant
de0bc78982 deps: bump encoding_rs to 0.8.16
This brings in an updated `encoding_rs` crate that uses `packed_simd`,
which compiles on the latest nightly. Compilation times do appear to be
impacted significantly though.

Fixes #1175 (again)
2019-02-07 17:05:14 -05:00
Steffen Banhardt
147e96914c ignore/types: *.dtx and *.ins added for tex
PR #1182
2019-01-31 09:06:19 -05:00
Andrew Gallant
0abc40c23c readme: bump MSRV
We bumped it a while back in the CI configuration, but didn't update the
README.
2019-01-29 13:10:43 -05:00
Andrew Gallant
f768796e4f deps: update other deps 2019-01-29 13:08:56 -05:00
Andrew Gallant
da0c0c4705 deps: update to crossbeam-channel 0.3.8
This drops dependencies on parking_lot and rand from ripgrep.

(rand is still used for tests.)
2019-01-29 13:07:37 -05:00
Andrew Gallant
05411b2b32 deprecated: remove use of ATOMIC_BOOL_INIT
Our MSRV is high enough that we can use const functions now.
2019-01-29 13:05:16 -05:00
Andrew Gallant
cc93db3b18 cargo: include auto-generated message
This is going to be annoying for a while if one switches between the
latest nightly compiler and older compilers. Sigh.
2019-01-29 13:04:40 -05:00
Alex Macleod
049354b766 readme: remove EOL Fedora install instructions
Fedora 27 and below are past their EOL, so it can now be said that it's
supported regularly on Fedora.

PR #1177
2019-01-28 08:15:36 -05:00
Andrew Gallant
386dd2806d changelog: BUG #916
This was fixed by bumping the MSRV above Rust 1.28.

Fixes #916
2019-01-27 13:15:17 -05:00
Andrew Gallant
5fe9a954e6 changelog: BUG #1154 2019-01-27 13:05:50 -05:00
Andrew Gallant
f158a42a71 ignore: correctly detect hidden files on Windows
This commit fixes a bug where ripgrep only treated files beginning with
a `.` as hidden. On Windows, we continue this tradition, but
additionally check whether a file has the special Windows "hidden"
attribute set. If so, we treat it as a hidden file.

In order to make this work without an additional stat call, we had to
rearrange some of the plumbing from the directory traverser.

Fixes #1154
2019-01-27 12:11:52 -05:00
Andrew Gallant
5724391d39 doc: small updates to the FAQ and GUIDE
Notably, ripgrep can do multiline search now. We also update the
supported compression format list and replace deprecated flags like
`--sort-files` with `--sort path`.
2019-01-26 16:19:09 -05:00
Andrew Gallant
0df71240ff search: fix -F and -f interaction bug
This fixes what appears to be a pretty egregious regression where the
`-F/--fixed-strings` flag wasn't be applied to patterns supplied via
the `-f/--file` flag. The same bug existed for the `-x/--line-regexp`
flag as well, which we fix here.

Fixes #1176
2019-01-26 16:01:52 -05:00
Andrew Gallant
f3164f2615 exit: tweak exit status logic
This changes how ripgrep emit exit status codes. In particular, any error
that occurs while searching will now cause ripgrep to emit a `2` exit
code, where as it previously would emit either a `0` or a `1` code based
on whether it matched or not. That is, ripgrep would only emit a `2` exit
code for a catastrophic error.

This tweak includes additional logic that GNU grep adheres to, which seems
like good sense. Namely, if -q/--quiet is given, and an error occurs and
a match occurs, then ripgrep will emit a `0` exit code.

Closes #1159
2019-01-26 15:44:49 -05:00
Andrew Gallant
31d3e24130 args: prevent panicking in 'rg -h | rg'
Previously, we relied on clap to handle printing either an error
message, or --help/--version output, in addition to setting the exit
status code. Unfortunately, for --help/--version output, clap was
panicking if the write failed, which can happen in fairly common
scenarios via a broken pipe error. e.g., `rg -h | head`.

We fix this by using clap's "safe" API and doing the printing ourselves.
We also set the exit code to `2` when an invalid command has been given.

Fixes #1125 and partially addresses #1159
2019-01-26 14:39:40 -05:00
Andrew Gallant
bf842dbc7f doc: add note about inverted flags
Fixes #1091
2019-01-26 14:13:06 -05:00
Andrew Gallant
6d5dba85bd doc: clarify automatic encoding detection
Fixes #1103
2019-01-26 13:55:47 -05:00
Andrew Gallant
afb89bcdad fmt: shorten --ignore-file-case-insensitive description 2019-01-26 13:45:02 -05:00
Andrew Gallant
332dc56372 changelog: BUG #1095 2019-01-26 13:40:59 -05:00
Andrew Gallant
12a6ca45f9 config: add --no-ignore-dot flag
This flag causes ripgrep to ignore `.ignore` files.

Closes #1138
2019-01-26 13:40:12 -05:00
Andrew Gallant
9d703110cf regex: make CRLF hack more robust
This commit improves the CRLF hack to be more robust. In particular, in
addition to rewriting `$` as `(?:\r??$)`, we now strip `\r` from the end
of a match if and only if the regex has an ending line anchor required for
a match. This doesn't quite make the hack 100% correct, but should fix most
use cases in practice. An example of a regex that will still be incorrect
is `foo|bar$`, since the analysis isn't quite sophisticated enough to
determine that a `\r` can be safely stripped from any match. Even if we
fix that, regexes like `foo\r|bar$` still won't be handled correctly. Alas,
more work on this front should really be focused on enabling this in the
regex engine itself.

The specific cause of this bug was that grep-searcher was sneakily
stripping CRLF from matching lines when it really shouldn't have. We remove
that code now, and instead rely on better match semantics provided at a
lower level.

Fixes #1095
2019-01-26 12:34:28 -05:00
Andrew Gallant
e99b6bda0e deps: bump regex-syntax to 0.6.5
This is necessary for the use of the new is_line_anchored_{start,end}
APIs.
2019-01-26 12:20:02 -05:00
Andrew Gallant
276e2c9b9a searcher: always strip BOM
This fixes a bug where a BOM prefix was included. While this was somewhat
intentional in order to have a faithful "UTF8 passthru" option, in
practice, this causes problems such as breaking patterns like `^` in a
really non-obvious way.

The actual fix was to add a new API to encoding_rs_io, which this commit
brings in.

Fixes #1163
2019-01-25 17:18:57 -05:00
Andrew Gallant
9a9f54d44c readme: encoding_rs's SIMD support is broken
Add a note about it to the README.

Also, remove mention of the avx-accel feature since it no longer exists.
(bytecount now uses runtime detection to enable SIMD support.)

Fixes #1175
2019-01-24 07:00:53 -05:00
Andrew Gallant
47833b9ce7 deps: update removal of grep devdeps 2019-01-23 20:14:37 -05:00
Awad Mackie
44a9e37737 ignore/types: add method for retrieving file type definition
Fixes #1116, Closes #1120
2019-01-23 20:08:48 -05:00
Andrew Gallant
8fd05cacee changelog: BUG #1121 2019-01-23 20:06:01 -05:00
Rob Lourens
4691d11034 ripgrep: don't skip stdout in --files mode
Specifically, this avoids triggering Windows antimalware when in --files mode.

See also #600.

Fixes #1121
2019-01-23 20:04:44 -05:00
Andrew Gallant
519a6b68af grep: remove unused dependencies
We remove these for now, but we'll eventually add them back once the
examples get more fleshed out.

Closes #1043
2019-01-23 20:01:32 -05:00
Andrew Gallant
9c940b45f4 globset: permit ** to appear anywhere
Previously, `man gitignore` specified that `**` was invalid unless it
was used in one of a few specific circumstances, i.e., `**`, `a/**`,
`**/b` or `a/**/b`. That is, `**` always had to be surrounded by either
a path separator or the beginning/end of the pattern.

It turns out that git itself has treated `**` outside the above contexts
as valid for quite a while, so there was an inconsistency between the
spec `man gitignore` and the implementation, and it wasn't clear which
was actually correct.

@okdana filed a bug against git[1] and got this fixed. The spec was wrong,
which has now been fixed [2] and updated[2].

This commit brings ripgrep in line with git and treats `**` outside of
the above contexts as two consecutive `*` patterns. We deprecate the
`InvalidRecursive` error since it is no longer used.

Fixes #373, Fixes #1098

[1] - https://public-inbox.org/git/C16A9F17-0375-42F9-90A9-A92C9F3D8BBA@dana.is
[2] - 627186d020
[3] - https://git-scm.com/docs/gitignore
2019-01-23 19:59:39 -05:00
Andrew Gallant
0a167021c3 changelog: BUG #1174 2019-01-23 19:19:26 -05:00
Andrew Gallant
aeaa5fc1b1 globset: fix repeated use of **
This fixes a bug where repeated use of ** didn't behave as it should. In
particular, each use of `**` added a new requirement directory depth
requirement. For example, something like `**/**/b` would match
`foo/bar/b`, but it wouldn't match `foo/b` even though it should. In
particular, `**` semantics demand "infinite" depth, so repeated uses of
`**` should just coalesce as if only one was given.

We do this coalescing in the parser. It's a little tricky because we
treat `**/a`, `a/**` and `a/**/b` as distinct tokens with their own
regex conversions. We also test the crap out of it.

Fixes #1174
2019-01-23 19:15:02 -05:00
Andrew Gallant
7048a06c31 changelog: BUG #1173 2019-01-23 18:14:16 -05:00
Andrew Gallant
23be3cf850 ignore: fix handling of **
When deciding whether to add the `**/` prefix or not, we should choose
not to add it if the pattern is simply a bare `**`. Previously, we were
only not adding it if it was `**/`, which is correct, but we also need
to do it for `**` since `**` can already match anywhere.

There's likely a more principled solution to this, but this works for
now.

Fixes #1173
2019-01-23 18:12:35 -05:00
Andrew Gallant
b48bbf527d changelog: PR #1093 2019-01-23 17:56:18 -05:00
dana
8eabe47b57 ignore: always use literal_separator for gitignore patterns (#1093)
PR #1093
2019-01-23 17:54:28 -05:00
Michele Bologna
ff712bfd9d readme: add instructions for openSUSE 15.0
PR #1088
2019-01-22 21:46:11 -05:00
Mika Dede
a7f2d48234 printer: fix path handling in summarizer
This commit fixes a bug where both of the following commands always
reported an error:

    rg --files-with-matches foo file
    rg --files-without-match foo file

In particular, the printer was erroneously respecting the `path` option
even the the summary kind was `PathWithMatch` or `PathWithoutMatch`. The
documented behavior is that those summary kinds always require a path,
and thus, the `path` option has no effect. We fix this by correcting the
case analysis.

This also fixes a bug where the exit code for `--files-without-match`
was not set correctly. We update the printer's `has_match` method to
report the correct value.

Fixes #1106, Closes #1130
2019-01-22 21:37:23 -05:00
Andrew Gallant
57500ad013 changelog: brotli/zstd addition 2019-01-22 20:57:28 -05:00
dana
0b04553aff grep-cli: support Brotli/Zstd decompression
Fixes #1099
2019-01-22 20:56:16 -05:00
dana
1ae121122f ignore/types: add/update brotli, bzip2, gzip, xz, zstd 2019-01-22 20:56:16 -05:00
Andrew Gallant
688003e51c ripgrep: ban rustfmt 2019-01-22 20:07:26 -05:00
David Torosyan
718a00f6f2 ripgrep: add --ignore-file-case-insensitive
The --ignore-file-case-insensitive flag causes all
.gitignore/.rgignore/.ignore files to have their globs matched without
regard for case. Because this introduces a potentially significant
performance regression, this is always disabled by default. Users that
need case insensitive matching can enable it on a case by case basis.

Closes #1164, Closes #1170
2019-01-22 20:03:59 -05:00
Andrew Gallant
7cbc535d70 edition: fix build.rs 2019-01-19 10:46:57 -05:00
Andrew Gallant
7a6a40bae1 edition: move core ripgrep to Rust 2018 2019-01-19 10:44:30 -05:00
Andrew Gallant
1e9ee2cc85 deps: update memmap 2019-01-19 10:44:30 -05:00
Andrew Gallant
968491f8e9 deps: update to bytecount 0.5
bytecount now uses runtime dispatch for enabling SIMD, which means we can
no longer need the avx-accel features. We remove it from ripgrep since the
next release will be a minor version bump, but leave them as no-ops for
the crates that previously used it.
2019-01-19 10:44:30 -05:00
Andrew Gallant
63b0f31a22 deps: update various dependencies
We also increase the MSRV to 1.32, the current stable release, which sets
the stage for migrating to Rust 2018.
2019-01-19 10:44:30 -05:00
P M
7ecee299a5 ignore/types: add QML
PR #1165
2019-01-18 06:48:47 -05:00
David Håsäther
dd396ff34e doc: fix typo
PR #1161
2019-01-14 06:50:30 -05:00
Andrew Gallant
fb0a82f3c3 grep-printer: add macro docs, redux 2019-01-11 09:18:09 -05:00
Andrew Gallant
dbc8ca9cc1 grep-searcher: add docs for assert_eq_printed
Looks like the deny(missing_docs) lint got a bit stronger.
2019-01-11 09:03:00 -05:00
Marco Hinz
c3db8db93d doc: fix typo 2019-01-05 11:18:05 -05:00
Andrew Gallant
17ef4c40f3 ignore-0.4.6 2018-12-30 08:46:09 -05:00
Andrew Gallant
a9e0477ea8 ignore: permit use of deprecated trim_right 2018-12-30 08:44:59 -05:00
Andrew Gallant
b3c5773266 deps: bump ignore 2018-12-30 08:43:18 -05:00
Andrew Gallant
118b950085 ignore-0.4.5 2018-12-15 08:44:10 -05:00
Andrew Gallant
b45b2f58ea deps: update most other dependencies
This commit is the result of doing:

  $ cargo update
  $ cargo update -p encoding_rs --precise 0.8.10

where the latter line prevents encoding_rs from updating to 0.8.11 (or
newer). In particular, the 0.8.11 release increased the minimum Rust
version to 1.29, where as ripgrep 0.10.x is still on 1.28. We stay on an
older version for now until ripgrep is ready to move to 0.11.x.
2018-12-15 08:42:14 -05:00
Andrew Gallant
662a9bc73d deps: update to crossbeam-channel 0.3
This also requires corresponding updates to both rand and rand_core. Doing
an update of rand without doing an update of rand_core results in
compilation errors because two distinct versions of rand_core are included
in the build, and the traits they expose are distinct and incompatible.

We also switch over to using tempfile instead of tempdir, which drops the
last remaining thing keeping rand 0.4 in the build.

Fixes #1141, Fixes #1142
2018-12-15 08:40:04 -05:00
Andrew Gallant
401add0a99 deps: update regex and regex-syntax
This brings in some new Unicode properties, such as \p{Emoji}.

It is now also technically possible construct a regex that recognizes
grapheme clusters.
2018-12-09 16:33:37 -05:00
Simon Morgan
f81b72721b ignore/types: add ASP
PR #1134
2018-12-07 16:19:33 -05:00
Antony Lee
1d4fccaadc ignore/types: add postscript
Although postscript/encapsulated postscript is usually thought of as a
binary format, it's actually mostly ASCII, so ripgrep will not ignore
these files.

The situation is basically the same as for pdf, which is also already
present in the list of known filetypes.

PR #1118
2018-11-23 09:46:11 -05:00
Matteo Bertini
09e464e674 ignore/types: add more Cython file types
From the [Cython file types](https://cython.readthedocs.io/en/latest/src/userguide/language_basics.html?highlight=pxi#cython-file-types) paragraph on the official docs:

> There are three file types in Cython:
>    The implementation files, carrying a .py or .pyx suffix.
>    The definition files, carrying a .pxd suffix.
>    The include files, carrying a .pxi suffix.

PR #1113
2018-11-19 07:37:00 -05:00
Jon Parise
31adff6f3c ignore/types: add Apache Thrift
PR #1102
2018-11-07 07:42:13 -05:00
Andrew Gallant
b41e596327 doc: escape braces in AsciiDoc
This commit fixes a bug where AsciiDoc would drop any line containing a
'{foo}' because it interpreted it as an undefined attribute reference:

> Simple attribute references take the form {<name>}. If the attribute name
> is defined its text value is substituted otherwise the line containing the
> reference is dropped from the output.

See: https://www.methods.co.nz/asciidoc/chunked/ch30.html

We fix this by simply replacing all occurrences of '{' and '}' with
their escaped forms: '&#123;' and '&#125;'.

Fixes #1101
2018-11-06 06:57:16 -05:00
Andrew Gallant
fb62266620 deps: update encoding_rs
This commit bumps the version of encoding_rs to use the latest release.
This appears to fix a panic in UTF-16 decoding.

Fixes #1089
2018-10-22 06:50:35 -04:00
Dave Lee
acf226c39d ignore/types: add BUILD.bazel to bazel file type
PR #1074
2018-10-02 18:00:04 -04:00
Mathieu Bridon
8299625e48 ignore/types: add buildstream
BuildStream is a Free Software tool for building/integrating software stacks.: https://buildstream.gitlab.io/buildstream/

It uses recipes written in YAML, in files with the `.bst` extension.

PR #1071
2018-09-28 08:32:24 -04:00
Andrew Gallant
db256c87eb ripgrep: suggest -U/--multiline
When a "\n literal is not allowed" error is reported, ripgrep will now
suggest the use of the -U/--multiline flag, which enables matching
newlines.

Fixes #1055
2018-09-25 16:56:04 -04:00
Andrew Gallant
ba533f390e grep-searcher: update to encoding_rs_io 0.1.3
This update includes a work-around for a presumed bug in encoding_rs
that causes a panic:
https://github.com/hsivonen/encoding_rs/issues/34

Specifically, to reproduce this in ripgrep, one can run the following:

    $ curl -LO https://cache.ruby-lang.org/pub/ruby/2.5/ruby-2.5.1.tar.gz
    $ tar xf ruby-2.5.1.tar.gz
    $ rg ZZZZZ ruby-2.5.1/test/rexml/data/t63-2.svg
    thread 'main' panicked at 'index out of bounds: the len is 1 but the index is 1'

Fixes #1052
2018-09-25 16:56:04 -04:00
Andrew Gallant
ba503eb677 grep-regex: fix inner literal detection
It seems the inner literal detector fails spectacularly in cases of
concatenations that involve groups. The issue here is that if the prefix
of a group inside a concatenation can match the empty string, then any
literals generated to that point in the concatenation need to be cut
such that they are never extended. The detector isn't really built to
handle this case, so we just act conservative cut literals whenever we
see a sub-group. This may make some regexes slower, but the inner
literal detector already misses plenty of cases.

Literal detection (including in the regex engine) is a key component
that needs to be completely rethought at some point.

Fixes #1064
2018-09-25 16:56:04 -04:00
Andrew Gallant
f72c2dfd90 readme: touch up README
Make the wording consistent.
2018-09-14 11:33:56 -04:00
Sylvestre Ledru
c0aa58b4f7 Ripgrep is also available in Ubuntu (from Cosmic) 2018-09-14 08:41:05 +02:00
ykgmfq
184ee4c328 deb: add section info
Put it in the same section as
https://packages.debian.org/stretch/grep

PR #1051
2018-09-13 08:17:24 -04:00
Gabe Berke-Williams
e82fbf2c46 doc: fix typo
"cretion" -> "creation"

PR #1045
2018-09-10 06:49:48 -04:00
Andrew Gallant
eb18da0450 pcre2: use jit_if_available
This will allow PCRE2 to fall back to non-JIT matching when running on
platforms without JIT support.

ref https://github.com/BurntSushi/rust-pcre2/issues/3
2018-09-08 17:12:14 -04:00
Andrew Gallant
0f7494216f readme: update dpkg version 2018-09-08 10:46:40 -04:00
Andrew Chin
442a278635 readme: fancy regexes are not supported by default
PR #1042
2018-09-07 17:43:24 -04:00
Andrew Gallant
7ebed3ace6 pkg: update brew tap to 0.10.0 2018-09-07 14:43:59 -04:00
Andrew Gallant
8a7db1a918 ci: tweak deployment conditions 2018-09-07 14:07:52 -04:00
Andrew Gallant
ce80d794c0 changelog: add release date 2018-09-07 14:00:23 -04:00
Andrew Gallant
c5d467a2ab ci: always force PCRE2 static builds for releases 2018-09-07 14:00:23 -04:00
Andrew Gallant
a62cd553c2 ci: clean up appveyor
Remove some outdated comments and unused config. Also, make the regex for
matching tags a bit more specific.
2018-09-07 14:00:22 -04:00
Andrew Gallant
ce5188335b ci: remove 'branch' condition for deployment
Travis docs[1] say this is ignore when 'tags' is used.

[1] - https://docs.travis-ci.com/user/deployment/#conditional-releases-with-on
2018-09-07 14:00:22 -04:00
Andrew Gallant
b7a456ae83 deb: add completions
This commit adds Bash, zsh and fish completions to the Debian binary
package.

Fixes #1032
2018-09-07 14:00:22 -04:00
Andrew Gallant
d14f0b37d6 deps: update versions for all crates
I don't think every change here is needed, but this ensures we're using
the latest version of every direct dependency.
2018-09-07 14:00:22 -04:00
Andrew Gallant
3ddc3c040f deps: minor updates 2018-09-07 13:03:01 -04:00
Andrew Gallant
eeaa42ecaf scripts: add copy-examples
This is a preliminary script to copy example code from a Markdown file
into a crate's example directory.

This is intended to be used for the upcoming libripgrep guide, but we
don't commit any examples yet.
2018-09-07 12:27:48 -04:00
Andrew Gallant
3797a2a5cb simplegrep: touch up 2018-09-07 12:24:50 -04:00
Andrew Gallant
0e2f8f7b47 grep: add clap and regex dev dependencies to grep
These are (or will be) used in grep's examples.
2018-09-07 12:06:05 -04:00
Andrew Gallant
3dd4b77dfb grep-searcher: add Box<...> impl for Sink
We initially did not have this impl because the first revision of the Sink
trait was much more complicated. In particular, each method was
parameterized over a Matcher. But not every Sink impl actually needs a
Matcher, and it is just as easy to borrow a Matcher explicitly, so the
added parameterization wasn't holding its own.

This does permit Sink implementations to be used as trait objects. One
key use case here is to reduce compile times, since there is quite a bit
of code inside grep-searcher that is parameterized on Sink. Unfortunately,
that code is *also* parameterized on Matcher, and the various printers in
grep-printer are also parameterized on Matcher, which means Sink trait
objects are necessary but no sufficient for a major reduction in compile
times. Unfortunately, the path to making Matcher object safe isn't quite
clear. Extension traits maybe? There's also stuff in the Serde ecosystem
that might help, but the type shenanigans can get pretty gnarly.
2018-09-07 12:06:05 -04:00
Andrew Gallant
3b5cdea862 doc: minor touchups to API docs 2018-09-07 12:06:05 -04:00
Andrew Gallant
54b3e9eb10 grep-printer: delete unused code 2018-09-07 12:06:05 -04:00
Andrew Gallant
56e8864426 grep-matcher: add LineTerminator::is_suffix
This centralizes the logic for checking whether a line has a line
terminator or not.
2018-09-07 12:06:04 -04:00
Andrew Gallant
b8f619d16e readme: a few clarifications 2018-09-07 12:06:04 -04:00
Andrew Gallant
83dff33326 deps: update various deps 2018-09-04 23:29:22 -04:00
Andrew Gallant
003c3695f4 deps: update grep version 2018-09-04 23:29:05 -04:00
Andrew Gallant
10777c150d grep-0.2.1 2018-09-04 23:25:39 -04:00
Andrew Gallant
827179250b changelog: assign feature id 2018-09-04 23:24:22 -04:00
Andrew Gallant
fd22cd520b windows: fix unused warnings on Windows 2018-09-04 23:18:55 -04:00
Andrew Gallant
241bc8f8fc ripgrep: add --pre-glob flag
The --pre-glob flag is like the --glob flag, except it applies to filtering
files through the preprocessor instead of for search. This makes it
possible to apply the preprocessor to only a small subset of files, which
can greatly reduce the process overhead of using a preprocessor when
searching large directories.
2018-09-04 23:18:55 -04:00
Andrew Gallant
b6e30124e0 ripgrep: add --line-buffered and --block-buffered
These flags provide granular control over ripgrep's buffering strategy.
The --line-buffered flag can be genuinely useful in certain types of shell
pipelines. The --block-buffered flag has a murkier use case, but we add it
for completeness.
2018-09-04 23:18:55 -04:00
Andrew Gallant
4846d63539 grep-cli: introduce new grep-cli crate
This commit moves a lot of "utility" code from ripgrep core into
grep-cli. Any one of these things might not be worth creating a new
crate, but combining everything together results in a fair number of a
convenience routines that make up a decent sized crate.

There is potentially more we could move into the crate, but much of what
remains in ripgrep core is almost entirely dealing with the number of
flags we support.

In the course of doing moving things to the grep-cli crate, we clean up
a lot of gunk and improve failure modes in a number of cases. In
particular, we've fixed a bug where other processes could deadlock if
they write too much to stderr.

Fixes #990
2018-09-04 23:18:55 -04:00
helloer
13c47530a6 ignore/types: add pascal type
PR #1036
2018-09-03 07:25:07 -04:00
Jakub Wilk
328f4369e6 doc: fix typos 2018-08-31 11:59:28 -04:00
Andrew Gallant
04518e32e7 deps: update other crates 2018-08-30 23:03:07 -04:00
Andrew Gallant
f2eaf5b977 deps: update termcolor for perf tweaks 2018-08-30 22:57:01 -04:00
Andrew Gallant
3edeeca6e9 changelog: fix typo 2018-08-29 18:46:34 -04:00
Andrew Gallant
c41b353009 changelog: update
This brings the changelog up to date with HEAD and rewords a few things.
2018-08-29 18:25:08 -04:00
Aaron Power
d18839f3dc ignore: add into_path for DirEntry (#1031)
This commit adds ignore::DirEntry::into_path to match
the corresponding method on walkdir::DirEntry.
2018-08-28 18:27:34 -04:00
Andrew Gallant
8f978a3cf7 doc: clarify and fix typo
Clarify that --byte-offset may be wrong if the source isn't being read
directly.

Also tweak the README a bit. And remove a damned Oxford comma.
2018-08-27 21:21:37 -04:00
Andrew Gallant
87b745454d ripgrep: use 'ignore' for skipping stdout
This removes ripgrep-specific code for filtering files that correspond to
stdout and instead uses the 'ignore' crate's functionality for doing the
same.
2018-08-27 21:18:53 -04:00
Andrew Gallant
e5bb750995 ignore: add 'stdout' skipping to the walker
This commit adds a new 'skip_stdout' option to the directory walker. When
enabled, it will skip yielding any directory entries that are believed to
correspond to stdout for the current process. This is useful for filtering
out 'results' in a command like 'grep -r foo > results' in order to avoid
an unbounded feedback mechanism.
2018-08-27 21:18:53 -04:00
dana
d599f0b3c7 complete: don't complete bare pattern after -f 2018-08-27 07:56:40 -04:00
Andrew Gallant
40e310a9f9 ripgrep: add --sort and --sortr flags
These flags each accept one of five choices: none, path, modified,
accessed or created. The value indicates how the results are sorted.
For --sort, results are sorted in ascending order where as for --sortr,
results are sorted in descending order.

Closes #404
2018-08-26 18:42:25 -04:00
Andrew Gallant
510f15f4da ignore: add sort_by_file_path builder method
This permits callers to sort entries by their full file path, which makes
it easy to query for various file statistics.

It would have been better to provide a comparator on DirEntry itself,
similar to how walkdir does it, but this seems to require quite a bit of
work to make the types work out, assuming we want to continue to use
walkdir's sorting support (we do).
2018-08-26 18:42:25 -04:00
Andrew Gallant
f9ce7a84a8 ignore: add 'same_file_system' option
This commit adds a 'same_file_system' option to the walk builder. For
single threaded walking, it defers to the walkdir crate, which has the
same option. The bulk of this commit implements this flag for the parallel
walker. We add one very feeble test for this.

The parallel walker is now officially a complete mess.

Closes #321
2018-08-26 18:42:25 -04:00
Andrew Gallant
1b6089674e deps: more updates 2018-08-26 18:42:25 -04:00
Andrew Gallant
05a0389555 ripgrep: use winapi-util for stdin_is_readable 2018-08-25 00:30:15 -04:00
Andrew Gallant
16353bad6e deps: update various deps
This includes a new crate, winapi-util, that is now used in wincolor,
walkdir and same-file.
2018-08-25 00:19:40 -04:00
Tim Kilbourn
fe442de091 changelog: fix typo
Fuchsia is a pain to spell.

PR #1026
2018-08-23 13:17:27 -04:00
Andrew Gallant
1bb8b7170f doc: clarify use of SIMD features
You need a nightly compiler.

Ref #188
2018-08-23 09:56:37 -04:00
Andrew Gallant
55ed698a98 deps: update walkdir minimum version
We'll want to be using the new `same_file_system` option soon.
2018-08-23 09:54:45 -04:00
Andrew Gallant
f1e025873f deps: update dependencies
This includes an update to walkdir 2.2.2, which includes a
`same_file_system` option.
2018-08-22 20:50:24 -04:00
Andrew Gallant
033ad2b8e4 deps: update clap
Update clap to the latest version.

Also, drop the ansi_term dependency by disabling color output in clap's
error messages.
2018-08-21 23:10:34 -04:00
Andrew Gallant
098a8ee843 deps: various patch upgrades 2018-08-21 23:05:52 -04:00
Andrew Gallant
2f3dbf5fee ignore: fix false positive in path_is_symlink
This commit fixes a bug where the first path always reported itself as
as symlink via `path_is_symlink`.

Part of this fix includes updating walkdir to 2.2.1, which also includes
a corresponding bug fix.

Fixes #984
2018-08-21 23:05:52 -04:00
Andrew Gallant
5c80e4adb6 release: better support for binary Debian package
This commit beefs up the package metadata used by the 'cargo deb' tool to
produce a binary dpkg. In particular, we now include ripgrep's man page.

This commit includes a new script, 'ci/build_deb.sh', which will handle
the build process for a dpkg, which has become a bit more nuanced than
just running 'cargo deb'. We don't (yet) run this script in CI.

Fixes #842
2018-08-21 23:05:52 -04:00
Andrew Gallant
fcd1853031 doc: update ripgrep's description
This now mentions PCRE2 support.
2018-08-21 23:05:52 -04:00
Andrew Gallant
74a89be641 grep-printer: fix bug in printing truncated lines
When emitting color, the printer wasn't checking whether the line
exceeded the maximum allowed length.
2018-08-21 23:05:52 -04:00
Andrew Gallant
5b1ce8bdc2 tests: touch up tests on Windows
This fixes warnings and adds an additional invalid UTF-8 test that will
run on Windows.
2018-08-21 23:05:52 -04:00
Andrew Gallant
1529ce3341 ripgrep: remove workaround for std bug
This commit undoes a work-around for a bug in Rust's standard library
that prevented correct file type detection on Windows in OneDrive
directories. We remove the work-around because we are moving to a
latest-stable Rust version policy, which has included this fix for a while
now.

ref #705, https://github.com/rust-lang/rust/issues/46484
2018-08-21 23:05:52 -04:00
Andrew Gallant
95a4f15916 ignore: clarify docs for DirEntry::error
Fixes #953
2018-08-21 23:05:52 -04:00
Andrew Gallant
0eef05142a ripgrep: move minimum version to Rust stable
This also updates some code to make use of our more liberal versioning
requirement, including the use of crossbeam-channel instead of the MsQueue
from the older an unmaintained crossbeam 0.3. This does regrettably add
a sizable number of dependencies, however, compile times seem mostly
unaffected.

Closes #1019
2018-08-21 23:05:52 -04:00
Andrew Gallant
edd6eb4e06 ripgrep: make --no-pcre2-unicode the canonical flag
Previously, we used --pcre2-unicode as the canonical flag despite the
fact that it is enabled by default, which is inconsistent with how we
handle other similar flags.

The reason why --pcre2-unicode was made the canonical flag was to make
it easier to discover since it would be sorted near the --pcre2 flag. To
solve that problem, we simply start a convention that lists related
flags in the docs.

Fixes #1022
2018-08-21 23:05:52 -04:00
Andrew Gallant
7ac9782970 doc: fix typo 2018-08-20 18:00:14 -04:00
Andrew Gallant
180054d7dc doc: caveats 2018-08-20 17:58:29 -04:00
101 changed files with 7892 additions and 2430 deletions

View File

@@ -1,4 +1,5 @@
language: rust
dist: xenial
env:
global:
- PROJECT_NAME: ripgrep
@@ -62,13 +63,13 @@ matrix:
# Minimum Rust supported channel. We enable these to make sure ripgrep
# continues to work on the advertised minimum Rust version.
- os: linux
rust: 1.23.0
rust: 1.34.0
env: TARGET=x86_64-unknown-linux-gnu
- os: linux
rust: 1.23.0
rust: 1.34.0
env: TARGET=x86_64-unknown-linux-musl
- os: linux
rust: 1.23.0
rust: 1.34.0
env: TARGET=arm-unknown-linux-gnueabihf GCC_VERSION=4.8
addons:
apt:
@@ -93,7 +94,7 @@ deploy:
skip_cleanup: true
on:
condition: $TRAVIS_RUST_VERSION = nightly
branch: master
branch: master # i guess we do need this after all?
tags: true
api_key:
secure: "IbSnsbGkxSydR/sozOf1/SRvHplzwRUHzcTjM7BKnr7GccL86gRPUrsrvD103KjQUGWIc1TnK1YTq5M0Onswg/ORDjqa1JEJPkPdPnVh9ipbF7M2De/7IlB4X4qXLKoApn8+bx2x/mfYXu4G+G1/2QdbaKK2yfXZKyjz0YFx+6CNrVCT2Nk8q7aHvOOzAL58vsG8iPDpupuhxlMDDn/UhyOWVInmPPQ0iJR1ZUJN8xJwXvKvBbfp3AhaBiAzkhXHNLgBR8QC5noWWMXnuVDMY3k4f3ic0V+p/qGUCN/nhptuceLxKFicMCYObSZeUzE5RAI0/OBW7l3z2iCoc+TbAnn+JrX/ObJCfzgAOXAU3tLaBFMiqQPGFKjKg1ltSYXomOFP/F7zALjpvFp4lYTBajRR+O3dqaxA9UQuRjw27vOeUpMcga4ZzL4VXFHzrxZKBHN//XIGjYAVhJ1NSSeGpeJV5/+jYzzWKfwSagRxQyVCzMooYFFXzn8Yxdm3PJlmp3GaAogNkdB9qKcrEvRINCelalzALPi0hD/HUDi8DD2PNTCLLMo6VSYtvc685Zbe+KgNzDV1YyTrRCUW6JotrS0r2ULLwnsh40hSB//nNv3XmwNmC/CmW5QAnIGj8cBMF4S2t6ohADIndojdAfNiptmaZOIT6owK7bWMgPMyopo="

View File

@@ -1,5 +1,157 @@
0.10.0 (TBD)
============
TBD
===
TODO.
Bug fixes:
* [BUG #1259](https://github.com/BurntSushi/ripgrep/issues/1259):
Fix bug where the last byte of a `-f file` was stripped if it wasn't a `\n`.
* [BUG #1302](https://github.com/BurntSushi/ripgrep/issues/1302):
Show better error messages when a non-existent preprocessor command is given.
11.0.1 (2019-04-16)
===================
ripgrep 11.0.1 is a new patch release that fixes a search regression introduced
in the previous 11.0.0 release. In particular, ripgrep can enter an infinite
loop for some search patterns when searching invalid UTF-8.
Bug fixes:
* [BUG #1247](https://github.com/BurntSushi/ripgrep/issues/1247):
Fix search bug that can cause ripgrep to enter an infinite loop.
11.0.0 (2019-04-15)
===================
ripgrep 11 is a new major version release of ripgrep that contains many bug
fixes, some performance improvements and a few feature enhancements. Notably,
ripgrep's user experience for binary file filtering has been improved. See the
[guide's new section on binary data](GUIDE.md#binary-data) for more details.
This release also marks a change in ripgrep's versioning. Where as the previous
version was `0.10.0`, this version is `11.0.0`. Moving forward, ripgrep's
major version will be increased a few times per year. ripgrep will continue to
be conservative with respect to backwards compatibility, but may occasionally
introduce breaking changes, which will always be documented in this CHANGELOG.
See [issue 1172](https://github.com/BurntSushi/ripgrep/issues/1172) for a bit
more detail on why this versioning change was made.
This release increases the **minimum supported Rust version** from 1.28.0 to
1.34.0.
**BREAKING CHANGES**:
* ripgrep has tweaked its exit status codes to be more like GNU grep's. Namely,
if a non-fatal error occurs during a search, then ripgrep will now always
emit a `2` exit status code, regardless of whether a match is found or not.
Previously, ripgrep would only emit a `2` exit status code for a catastrophic
error (e.g., regex syntax error). One exception to this is if ripgrep is run
with `-q/--quiet`. In that case, if an error occurs and a match is found,
then ripgrep will exit with a `0` exit status code.
* Supplying the `-u/--unrestricted` flag three times is now equivalent to
supplying `--no-ignore --hidden --binary`. Previously, `-uuu` was equivalent
to `--no-ignore --hidden --text`. The difference is that `--binary` disables
binary file filtering without potentially dumping binary data into your
terminal. That is, `rg -uuu foo` should now be equivalent to `grep -r foo`.
* The `avx-accel` feature of ripgrep has been removed since it is no longer
necessary. All uses of AVX in ripgrep are now enabled automatically via
runtime CPU feature detection. The `simd-accel` feature does remain available
(only for enabling SIMD for transcoding), however, it does increase
compilation times substantially at the moment.
Performance improvements:
* [PERF #497](https://github.com/BurntSushi/ripgrep/issues/497),
[PERF #838](https://github.com/BurntSushi/ripgrep/issues/838):
Make `rg -F -f dictionary-of-literals` much faster.
Feature enhancements:
* Added or improved file type filtering for Apache Thrift, ASP, Bazel, Brotli,
BuildStream, bzip2, C, C++, Cython, gzip, Java, Make, Postscript, QML, Tex,
XML, xz, zig and zstd.
* [FEATURE #855](https://github.com/BurntSushi/ripgrep/issues/855):
Add `--binary` flag for disabling binary file filtering.
* [FEATURE #1078](https://github.com/BurntSushi/ripgrep/pull/1078):
Add `--max-columns-preview` flag for showing a preview of long lines.
* [FEATURE #1099](https://github.com/BurntSushi/ripgrep/pull/1099):
Add support for Brotli and Zstd to the `-z/--search-zip` flag.
* [FEATURE #1138](https://github.com/BurntSushi/ripgrep/pull/1138):
Add `--no-ignore-dot` flag for ignoring `.ignore` files.
* [FEATURE #1155](https://github.com/BurntSushi/ripgrep/pull/1155):
Add `--auto-hybrid-regex` flag for automatically falling back to PCRE2.
* [FEATURE #1159](https://github.com/BurntSushi/ripgrep/pull/1159):
ripgrep's exit status logic should now match GNU grep. See updated man page.
* [FEATURE #1164](https://github.com/BurntSushi/ripgrep/pull/1164):
Add `--ignore-file-case-insensitive` for case insensitive ignore globs.
* [FEATURE #1185](https://github.com/BurntSushi/ripgrep/pull/1185):
Add `-I` flag as a short option for the `--no-filename` flag.
* [FEATURE #1207](https://github.com/BurntSushi/ripgrep/pull/1207):
Add `none` value to `-E/--encoding` to forcefully disable all transcoding.
* [FEATURE da9d7204](https://github.com/BurntSushi/ripgrep/commit/da9d7204):
Add `--pcre2-version` for querying showing PCRE2 version information.
Bug fixes:
* [BUG #306](https://github.com/BurntSushi/ripgrep/issues/306),
[BUG #855](https://github.com/BurntSushi/ripgrep/issues/855):
Improve the user experience for ripgrep's binary file filtering.
* [BUG #373](https://github.com/BurntSushi/ripgrep/issues/373),
[BUG #1098](https://github.com/BurntSushi/ripgrep/issues/1098):
`**` is now accepted as valid syntax anywhere in a glob.
* [BUG #916](https://github.com/BurntSushi/ripgrep/issues/916):
ripgrep no longer hangs when searching `/proc` with a zombie process present.
* [BUG #1052](https://github.com/BurntSushi/ripgrep/issues/1052):
Fix bug where ripgrep could panic when transcoding UTF-16 files.
* [BUG #1055](https://github.com/BurntSushi/ripgrep/issues/1055):
Suggest `-U/--multiline` when a pattern contains a `\n`.
* [BUG #1063](https://github.com/BurntSushi/ripgrep/issues/1063):
Always strip a BOM if it's present, even for UTF-8.
* [BUG #1064](https://github.com/BurntSushi/ripgrep/issues/1064):
Fix inner literal detection that could lead to incorrect matches.
* [BUG #1079](https://github.com/BurntSushi/ripgrep/issues/1079):
Fixes a bug where the order of globs could result in missing a match.
* [BUG #1089](https://github.com/BurntSushi/ripgrep/issues/1089):
Fix another bug where ripgrep could panic when transcoding UTF-16 files.
* [BUG #1091](https://github.com/BurntSushi/ripgrep/issues/1091):
Add note about inverted flags to the man page.
* [BUG #1093](https://github.com/BurntSushi/ripgrep/pull/1093):
Fix handling of literal slashes in gitignore patterns.
* [BUG #1095](https://github.com/BurntSushi/ripgrep/issues/1095):
Fix corner cases involving the `--crlf` flag.
* [BUG #1101](https://github.com/BurntSushi/ripgrep/issues/1101):
Fix AsciiDoc escaping for man page output.
* [BUG #1103](https://github.com/BurntSushi/ripgrep/issues/1103):
Clarify what `--encoding auto` does.
* [BUG #1106](https://github.com/BurntSushi/ripgrep/issues/1106):
`--files-with-matches` and `--files-without-match` work with one file.
* [BUG #1121](https://github.com/BurntSushi/ripgrep/issues/1121):
Fix bug that was triggering Windows antimalware when using the `--files`
flag.
* [BUG #1125](https://github.com/BurntSushi/ripgrep/issues/1125),
[BUG #1159](https://github.com/BurntSushi/ripgrep/issues/1159):
ripgrep shouldn't panic for `rg -h | rg` and should emit correct exit status.
* [BUG #1144](https://github.com/BurntSushi/ripgrep/issues/1144):
Fixes a bug where line numbers could be wrong on big-endian machines.
* [BUG #1154](https://github.com/BurntSushi/ripgrep/issues/1154):
Windows files with "hidden" attribute are now treated as hidden.
* [BUG #1173](https://github.com/BurntSushi/ripgrep/issues/1173):
Fix handling of `**` patterns in gitignore files.
* [BUG #1174](https://github.com/BurntSushi/ripgrep/issues/1174):
Fix handling of repeated `**` patterns in gitignore files.
* [BUG #1176](https://github.com/BurntSushi/ripgrep/issues/1176):
Fix bug where `-F`/`-x` weren't applied to patterns given via `-f`.
* [BUG #1189](https://github.com/BurntSushi/ripgrep/issues/1189):
Document cases where ripgrep may use a lot of memory.
* [BUG #1203](https://github.com/BurntSushi/ripgrep/issues/1203):
Fix a matching bug related to the suffix literal optimization.
* [BUG 8f14cb18](https://github.com/BurntSushi/ripgrep/commit/8f14cb18):
Increase the default stack size for PCRE2's JIT.
0.10.0 (2018-09-07)
===================
This is a new minor version release of ripgrep that contains some major new
features, a huge number of bug fixes, and is the first release based on
libripgrep. The entirety of ripgrep's core search and printing code has been
@@ -10,24 +162,32 @@ format.
**BREAKING CHANGES**:
* The minimum version required to compile Rust has now changed to track the
latest stable version of Rust. Patch releases will continue to compile with
the same version of Rust as the previous patch release, but new minor
versions will use the current stable version of the Rust compile as its
minimum supported version.
* The match semantics of `-w/--word-regexp` have changed slightly. They used
to be `\b(?:<your pattern>)\b`, but now it's
`(?:^|\W)(?:<your pattern>)(?:$|\W)`.
See [#389](https://github.com/BurntSushi/ripgrep/issues/389) for more
details.
`(?:^|\W)(?:<your pattern>)(?:$|\W)`. This matches the behavior of GNU grep
and is believed to be closer to the intended semantics of the flag. See
[#389](https://github.com/BurntSushi/ripgrep/issues/389) for more details.
Feature enhancements:
* [FEATURE #162](https://github.com/BurntSushi/ripgrep/issues/162):
libripgrep is now a thing, composed of the following crates:
`grep`, `grep-matcher`, `grep-pcre2`, `grep-printer`, `grep-regex` and
`grep-searcher`.
libripgrep is now a thing. The primary crate is
[`grep`](https://docs.rs/grep).
* [FEATURE #176](https://github.com/BurntSushi/ripgrep/issues/176):
Add `-U/--multiline` flag that permits matching over multiple lines.
* [FEATURE #188](https://github.com/BurntSushi/ripgrep/issues/188):
Add `-P/--pcre2` flag that gives support for look-around and backreferences.
* [FEATURE #244](https://github.com/BurntSushi/ripgrep/issues/244):
Add `--json` flag that prints results in a JSON Lines format.
* [FEATURE #321](https://github.com/BurntSushi/ripgrep/issues/321):
Add `--one-file-system` flag to skip directories on different file systems.
* [FEATURE #404](https://github.com/BurntSushi/ripgrep/issues/404):
Add `--sort` and `--sortr` flag for more sorting. Deprecate `--sort-files`.
* [FEATURE #416](https://github.com/BurntSushi/ripgrep/issues/416):
Add `--crlf` flag to permit `$` to work with carriage returns on Windows.
* [FEATURE #917](https://github.com/BurntSushi/ripgrep/issues/917):
@@ -36,6 +196,10 @@ Feature enhancements:
Add `--null-data` flag, which makes ripgrep use NUL as a line terminator.
* [FEATURE #997](https://github.com/BurntSushi/ripgrep/issues/997):
The `--passthru` flag now works with the `--replace` flag.
* [FEATURE #1038-1](https://github.com/BurntSushi/ripgrep/issues/1038):
Add `--line-buffered` and `--block-buffered` for forcing a buffer strategy.
* [FEATURE #1038-2](https://github.com/BurntSushi/ripgrep/issues/1038):
Add `--pre-glob` for filtering files through the `--pre` flag.
Bug fixes:
@@ -53,14 +217,22 @@ Bug fixes:
Matching empty lines now works correctly in several corner cases.
* [BUG #764](https://github.com/BurntSushi/ripgrep/issues/764):
Color escape sequences now coalesce, which reduces output size.
* [BUG #842](https://github.com/BurntSushi/ripgrep/issues/842):
Add man page to binary Debian package.
* [BUG #922](https://github.com/BurntSushi/ripgrep/issues/922):
ripgrep is now more robust with respect to memory maps failing.
* [BUG #937](https://github.com/BurntSushi/ripgrep/issues/937):
Color escape sequences are no longer emitted for empty matches.
* [BUG #940](https://github.com/BurntSushi/ripgrep/issues/940):
Context from the `--passthru` flag should not impact process exit status.
* [BUG #984](https://github.com/BurntSushi/ripgrep/issues/984):
Fixes bug in `ignore` crate where first path was always treated as a symlink.
* [BUG #990](https://github.com/BurntSushi/ripgrep/issues/990):
Read stderr asynchronously when running a process.
* [BUG #1013](https://github.com/BurntSushi/ripgrep/issues/1013):
Add compile time and runtime CPU features to `--version` output.
* [BUG #1028](https://github.com/BurntSushi/ripgrep/pull/1028):
Don't complete bare pattern after `-f` in zsh.
0.9.0 (2018-08-03)
@@ -96,7 +268,7 @@ multi-line search support and a JSON output format.
Feature enhancements:
* Added or improved file type filtering for Android, Bazel, Fuschia, Haskell,
* Added or improved file type filtering for Android, Bazel, Fuchsia, Haskell,
Java and Puppet.
* [FEATURE #411](https://github.com/BurntSushi/ripgrep/issues/411):
Add a `--stats` flag, which emits aggregate statistics after search results.

702
Cargo.lock generated

File diff suppressed because it is too large Load Diff

View File

@@ -1,6 +1,6 @@
[package]
name = "ripgrep"
version = "0.9.0" #:version
version = "11.0.1" #:version
authors = ["Andrew Gallant <jamslam@gmail.com>"]
description = """
ripgrep is a line-oriented search tool that recursively searches your current
@@ -17,6 +17,7 @@ license = "Unlicense OR MIT"
exclude = ["HomebrewFormula"]
build = "build.rs"
autotests = false
edition = "2018"
[badges]
travis-ci = { repository = "BurntSushi/ripgrep" }
@@ -35,6 +36,7 @@ path = "tests/tests.rs"
members = [
"globset",
"grep",
"grep-cli",
"grep-matcher",
"grep-pcre2",
"grep-printer",
@@ -44,43 +46,66 @@ members = [
]
[dependencies]
atty = "0.2.11"
globset = { version = "0.4.0", path = "globset" }
grep = { version = "0.2.0", path = "grep" }
ignore = { version = "0.4.0", path = "ignore" }
lazy_static = "1"
log = "0.4"
num_cpus = "1"
regex = "1"
same-file = "1"
serde_json = "1"
termcolor = "1"
bstr = "0.2.0"
grep = { version = "0.2.4", path = "grep" }
ignore = { version = "0.4.7", path = "ignore" }
lazy_static = "1.1.0"
log = "0.4.5"
num_cpus = "1.8.0"
regex = "1.0.5"
serde_json = "1.0.23"
termcolor = "1.0.3"
[dependencies.clap]
version = "2.29.4"
version = "2.32.0"
default-features = false
features = ["suggestions", "color"]
features = ["suggestions"]
[target.'cfg(windows)'.dependencies.winapi]
version = "0.3"
features = ["std", "fileapi", "winnt"]
[target.'cfg(all(target_env = "musl", target_pointer_width = "64"))'.dependencies.jemallocator]
version = "0.3.0"
[build-dependencies]
lazy_static = "1"
lazy_static = "1.1.0"
[build-dependencies.clap]
version = "2.29.4"
version = "2.32.0"
default-features = false
features = ["suggestions", "color"]
features = ["suggestions"]
[dev-dependencies]
serde = "1"
serde_derive = "1"
serde = "1.0.77"
serde_derive = "1.0.77"
[features]
avx-accel = ["grep/avx-accel"]
simd-accel = ["grep/simd-accel"]
pcre2 = ["grep/pcre2"]
[profile.release]
debug = 1
[package.metadata.deb]
features = ["pcre2"]
section = "utils"
assets = [
["target/release/rg", "usr/bin/", "755"],
["COPYING", "usr/share/doc/ripgrep/", "644"],
["LICENSE-MIT", "usr/share/doc/ripgrep/", "644"],
["UNLICENSE", "usr/share/doc/ripgrep/", "644"],
["CHANGELOG.md", "usr/share/doc/ripgrep/CHANGELOG", "644"],
["README.md", "usr/share/doc/ripgrep/README", "644"],
["FAQ.md", "usr/share/doc/ripgrep/FAQ", "644"],
# The man page is automatically generated by ripgrep's build process, so
# this file isn't actually commited. Instead, to create a dpkg, either
# create a deployment/deb directory and copy the man page to it, or use the
# 'ci/build_deb.sh' script.
["deployment/deb/rg.1", "usr/share/man/man1/rg.1", "644"],
# Similarly for shell completions.
["deployment/deb/rg.bash", "usr/share/bash-completion/completions/rg", "644"],
["deployment/deb/rg.fish", "usr/share/fish/completions/rg.fish", "644"],
["deployment/deb/_rg", "usr/share/zsh/vendor-completions/", "644"],
]
extended-description = """\
ripgrep (rg) recursively searches your current directory for a regex pattern.
By default, ripgrep will respect your .gitignore and automatically skip hidden
files/directories and binary files.
"""

27
FAQ.md
View File

@@ -118,7 +118,7 @@ from run to run of ripgrep.
The only way to make the order of results consistent is to ask ripgrep to
sort the output. Currently, this will disable all parallelism. (On smaller
repositories, you might not notice much of a performance difference!) You
can achieve this with the `--sort-files` flag.
can achieve this with the `--sort path` flag.
There is more discussion on this topic here:
https://github.com/BurntSushi/ripgrep/issues/152
@@ -136,10 +136,10 @@ How do I search compressed files?
</h3>
ripgrep's `-z/--search-zip` flag will cause it to search compressed files
automatically. Currently, this supports gzip, bzip2, lzma, lz4 and xz only and
requires the corresponding `gzip`, `bzip2` and `xz` binaries to be installed on
your system. (That is, ripgrep does decompression by shelling out to another
process.)
automatically. Currently, this supports gzip, bzip2, xz, lzma, lz4, Brotli and
Zstd. Each of these requires requires the corresponding `gzip`, `bzip2`, `xz`,
`lz4`, `brotli` and `zstd` binaries to be installed on your system. (That is,
ripgrep does decompression by shelling out to another process.)
ripgrep currently does not search archive formats, so `*.tar.gz` files, for
example, are skipped.
@@ -149,9 +149,8 @@ example, are skipped.
How do I search over multiple lines?
</h3>
This isn't currently possible. ripgrep is fundamentally a line-oriented search
tool. With that said,
[multiline search is a planned opt-in feature](https://github.com/BurntSushi/ripgrep/issues/176).
The `-U/--multiline` flag enables ripgrep to report results that span over
multiple lines.
<h3 name="fancy">
@@ -635,7 +634,7 @@ real 0m1.714s
user 0m1.669s
sys 0m0.044s
[andrew@Cheetah 2016] time rg -P '^\w{42}$' subtitles2016-sample --no-pcre2-unicode
$ time rg -P '^\w{42}$' subtitles2016-sample --no-pcre2-unicode
21225780:EverymajordevelopmentinthehistoryofAmerica
real 0m1.997s
@@ -675,14 +674,18 @@ no longer needs to do any kind of UTF-8 checks. This allows the file to get
memory mapped and passed right through PCRE2's JIT at impressive speeds. (As
a brief and interesting historical note, the configuration of "memory map +
multiline + no-Unicode" is exactly the configuration used by The Silver
Searcher. This analysis perhaps sheds some reasoning as to why it converged on
that specific setting!)
Searcher. This analysis perhaps sheds some reasoning as to why that
configuration is useful!)
In summary, if you want PCRE2 to go as fast as possible and you don't care
about Unicode and you don't care about matches possibly spanning across
multiple lines, then enable multiline mode with `-U` and disable PCRE2's
Unicode support with the `--no-pcre2-unicode` flag.
Caveat emptor: This author is not a PCRE2 expert, so there may be APIs that can
improve performance that the author missed. Similarly, there may be alternative
designs for a searching tool that are more amenable to how PCRE2 works.
<h3 name="rg-other-cmd">
When I run <code>rg</code>, why does it execute some other command?
@@ -932,7 +935,7 @@ for the previous section apply.
* Are you writing portable shell scripts intended to work in a variety of
environments? Great, probably not a good idea to use ripgrep! ripgrep is has
nowhere near the ubquity of grep, so if you do use ripgrep, you might need
nowhere near the ubiquity of grep, so if you do use ripgrep, you might need
to futz with the installation process more than you would with grep.
* Do you care about POSIX compatibility? If so, then you can't use ripgrep
because it never was, isn't and never will be POSIX compatible.

132
GUIDE.md
View File

@@ -18,6 +18,7 @@ translatable to any command line shell environment.
* [Replacements](#replacements)
* [Configuration file](#configuration-file)
* [File encoding](#file-encoding)
* [Binary data](#binary-data)
* [Common options](#common-options)
@@ -109,7 +110,7 @@ colors, you'll notice that `faster` will be highlighted instead of just the
It is beyond the scope of this guide to provide a full tutorial on regular
expressions, but ripgrep's specific syntax is documented here:
https://docs.rs/regex/0.2.5/regex/#syntax
https://docs.rs/regex/*/regex/#syntax
### Recursive search
@@ -227,7 +228,7 @@ with the following contents:
```
ripgrep treats `.ignore` files with higher precedence than `.gitignore` files
(and treats `.rgignore` files with higher precdence than `.ignore` files).
(and treats `.rgignore` files with higher precedence than `.ignore` files).
This means ripgrep will see the `!log/` whitelist rule first and search that
directory.
@@ -235,6 +236,11 @@ Like `.gitignore`, a `.ignore` file can be placed in any directory. Its rules
will be processed with respect to the directory it resides in, just like
`.gitignore`.
To process `.gitignore` and `.ignore` files case insensitively, use the flag
`--ignore-file-case-insensitive`. This is especially useful on case insensitive
file systems like those on Windows and macOS. Note though that this can come
with a significant performance penalty, and is therefore disabled by default.
For a more in depth description of how glob patterns in a `.gitignore` file
are interpreted, please see `man gitignore`.
@@ -520,9 +526,9 @@ config file. Once the environment variable is set, open the file and just type
in the flags you want set automatically. There are only two rules for
describing the format of the config file:
1. Every line is a shell argument, after trimming ASCII whitespace.
2. Lines starting with `#` (optionally preceded by any amount of
ASCII whitespace) are ignored.
1. Every line is a shell argument, after trimming whitespace.
2. Lines starting with `#` (optionally preceded by any amount of whitespace)
are ignored.
In particular, there is no escaping. Each line is given to ripgrep as a single
command line argument verbatim.
@@ -532,8 +538,9 @@ formatting peculiarities:
```
$ cat $HOME/.ripgreprc
# Don't let ripgrep vomit really long lines to my terminal.
# Don't let ripgrep vomit really long lines to my terminal, and show a preview.
--max-columns=150
--max-columns-preview
# Add my 'web' type.
--type-add
@@ -580,7 +587,7 @@ override it.
If you're confused about what configuration file ripgrep is reading arguments
from, then running ripgrep with the `--debug` flag should help clarify things.
The debug output should note what config file is being loaded and the arugments
The debug output should note what config file is being loaded and the arguments
that have been read from the configuration.
Finally, if you want to make absolutely sure that ripgrep *isn't* reading a
@@ -598,13 +605,14 @@ topic, but we can try to summarize its relevancy to ripgrep:
* Files are generally just a bundle of bytes. There is no reliable way to know
their encoding.
* Either the encoding of the pattern must match the encoding of the files being
searched, or a form of transcoding must be performed converts either the
searched, or a form of transcoding must be performed that converts either the
pattern or the file to the same encoding as the other.
* ripgrep tends to work best on plain text files, and among plain text files,
the most popular encodings likely consist of ASCII, latin1 or UTF-8. As
a special exception, UTF-16 is prevalent in Windows environments
In light of the above, here is how ripgrep behaves:
In light of the above, here is how ripgrep behaves when `--encoding auto` is
given, which is the default:
* All input is assumed to be ASCII compatible (which means every byte that
corresponds to an ASCII codepoint actually is an ASCII codepoint). This
@@ -620,12 +628,15 @@ In light of the above, here is how ripgrep behaves:
they correspond to a UTF-16 BOM, then ripgrep will transcode the contents of
the file from UTF-16 to UTF-8, and then execute the search on the transcoded
version of the file. (This incurs a performance penalty since transcoding
is slower than regex searching.)
is slower than regex searching.) If the file contains invalid UTF-16, then
the Unicode replacement codepoint is substituted in place of invalid code
units.
* To handle other cases, ripgrep provides a `-E/--encoding` flag, which permits
you to specify an encoding from the
[Encoding Standard](https://encoding.spec.whatwg.org/#concept-encoding-get).
ripgrep will assume *all* files searched are the encoding specified and
will perform a transcoding step just like in the UTF-16 case described above.
ripgrep will assume *all* files searched are the encoding specified (unless
the file has a BOM) and will perform a transcoding step just like in the
UTF-16 case described above.
By default, ripgrep will not require its input be valid UTF-8. That is, ripgrep
can and will search arbitrary bytes. The key here is that if you're searching
@@ -635,9 +646,26 @@ pattern won't find anything. With all that said, this mode of operation is
important, because it lets you find ASCII or UTF-8 *within* files that are
otherwise arbitrary bytes.
As a special case, the `-E/--encoding` flag supports the value `none`, which
will completely disable all encoding related logic, including BOM sniffing.
When `-E/--encoding` is set to `none`, ripgrep will search the raw bytes of
the underlying file with no transcoding step. For example, here's how you might
search the raw UTF-16 encoding of the string `Шерлок`:
```
$ rg '(?-u)\(\x045\x04@\x04;\x04>\x04:\x04' -E none -a some-utf16-file
```
Of course, that's just an example meant to show how one can drop down into
raw bytes. Namely, the simpler command works as you might expect automatically:
```
$ rg 'Шерлок' some-utf16-file
```
Finally, it is possible to disable ripgrep's Unicode support from within the
pattern regular expression. For example, let's say you wanted `.` to match any
byte rather than any Unicode codepoint. (You might want this while searching a
regular expression. For example, let's say you wanted `.` to match any byte
rather than any Unicode codepoint. (You might want this while searching a
binary file, since `.` by default will not match invalid UTF-8.) You could do
this by disabling Unicode via a regular expression flag:
@@ -654,6 +682,76 @@ $ rg '\w(?-u:\w)\w'
```
### Binary data
In addition to skipping hidden files and files in your `.gitignore` by default,
ripgrep also attempts to skip binary files. ripgrep does this by default
because binary files (like PDFs or images) are typically not things you want to
search when searching for regex matches. Moreover, if content in a binary file
did match, then it's possible for undesirable binary data to be printed to your
terminal and wreak havoc.
Unfortunately, unlike skipping hidden files and respecting your `.gitignore`
rules, a file cannot as easily be classified as binary. In order to figure out
whether a file is binary, the most effective heuristic that balances
correctness with performance is to simply look for `NUL` bytes. At that point,
the determination is simple: a file is considered "binary" if and only if it
contains a `NUL` byte somewhere in its contents.
The issue is that while most binary files will have a `NUL` byte toward the
beginning of its contents, this is not necessarily true. The `NUL` byte might
be the very last byte in a large file, but that file is still considered
binary. While this leads to a fair amount of complexity inside ripgrep's
implementation, it also results in some unintuitive user experiences.
At a high level, ripgrep operates in three different modes with respect to
binary files:
1. The default mode is to attempt to remove binary files from a search
completely. This is meant to mirror how ripgrep removes hidden files and
files in your `.gitignore` automatically. That is, as soon as a file is
detected as binary, searching stops. If a match was already printed (because
it was detected long before a `NUL` byte), then ripgrep will print a warning
message indicating that the search stopped prematurely. This default mode
**only applies to files searched by ripgrep as a result of recursive
directory traversal**, which is consistent with ripgrep's other automatic
filtering. For example, `rg foo .file` will search `.file` even though it
is hidden. Similarly, `rg foo binary-file` search `binary-file` in "binary"
mode automatically.
2. Binary mode is similar to the default mode, except it will not always
stop searching after it sees a `NUL` byte. Namely, in this mode, ripgrep
will continue searching a file that is known to be binary until the first
of two conditions is met: 1) the end of the file has been reached or 2) a
match is or has been seen. This means that in binary mode, if ripgrep
reports no matches, then there are no matches in the file. When a match does
occur, ripgrep prints a message similar to one it prints when in its default
mode indicating that the search has stopped prematurely. This mode can be
forcefully enabled for all files with the `--binary` flag. The purpose of
binary mode is to provide a way to discover matches in all files, but to
avoid having binary data dumped into your terminal.
3. Text mode completely disables all binary detection and searches all files
as if they were text. This is useful when searching a file that is
predominantly text but contains a `NUL` byte, or if you are specifically
trying to search binary data. This mode can be enabled with the `-a/--text`
flag. Note that when using this mode on very large binary files, it is
possible for ripgrep to use a lot of memory.
Unfortunately, there is one additional complexity in ripgrep that can make it
difficult to reason about binary files. That is, the way binary detection works
depends on the way that ripgrep searches your files. Specifically:
* When ripgrep uses memory maps, then binary detection is only performed on the
first few kilobytes of the file in addition to every matching line.
* When ripgrep doesn't use memory maps, then binary detection is performed on
all bytes searched.
This means that whether a file is detected as binary or not can change based
on the internal search strategy used by ripgrep. If you prefer to keep
ripgrep's binary file detection consistent, then you can disable memory maps
via the `--no-mmap` flag. (The cost will be a small performance regression when
searching very large files on some platforms.)
### Common options
ripgrep has a lot of flags. Too many to keep in your head at once. This section
@@ -675,10 +773,10 @@ used options that will likely impact how you use ripgrep on a regular basis.
* `--files`: Print the files that ripgrep *would* search, but don't actually
search them.
* `-a/--text`: Search binary files as if they were plain text.
* `-z/--search-zip`: Search compressed files (gzip, bzip2, lzma, xz). This is
disabled by default.
* `-z/--search-zip`: Search compressed files (gzip, bzip2, lzma, xz, lz4,
brotli, zstd). This is disabled by default.
* `-C/--context`: Show the lines surrounding a match.
* `--sort-files`: Force ripgrep to sort its output by file name. (This disables
* `--sort path`: Force ripgrep to sort its output by file name. (This disables
parallelism, so it might be slower.)
* `-L/--follow`: Follow symbolic links while recursively searching.
* `-M/--max-columns`: Limit the length of lines printed by ripgrep.

114
README.md
View File

@@ -1,15 +1,17 @@
ripgrep (rg)
------------
ripgrep is a line-oriented search tool that recursively searches your current
directory for a regex pattern while respecting your gitignore rules. ripgrep
directory for a regex pattern. By default, ripgrep will respect your .gitignore
and automatically skip hidden files/directories and binary files. ripgrep
has first class support on Windows, macOS and Linux, with binary downloads
available for [every release](https://github.com/BurntSushi/ripgrep/releases).
ripgrep is similar to other popular search tools like The Silver Searcher,
ack and grep.
ripgrep is similar to other popular search tools like The Silver Searcher, ack
and grep.
[![Linux build status](https://travis-ci.org/BurntSushi/ripgrep.svg)](https://travis-ci.org/BurntSushi/ripgrep)
[![Windows build status](https://ci.appveyor.com/api/projects/status/github/BurntSushi/ripgrep?svg=true)](https://ci.appveyor.com/project/BurntSushi/ripgrep)
[![Crates.io](https://img.shields.io/crates/v/ripgrep.svg)](https://crates.io/crates/ripgrep)
[![Packaging status](https://repology.org/badge/tiny-repos/ripgrep.svg)](https://repology.org/project/ripgrep/badges)
Dual-licensed under MIT or the [UNLICENSE](http://unlicense.org).
@@ -23,7 +25,7 @@ Please see the [CHANGELOG](CHANGELOG.md) for a release history.
* [Installation](#installation)
* [User Guide](GUIDE.md)
* [Frequently Asked Questions](FAQ.md)
* [Regex syntax](https://docs.rs/regex/0.2.5/regex/#syntax)
* [Regex syntax](https://docs.rs/regex/1/regex/#syntax)
* [Configuration files](GUIDE.md#configuration-file)
* [Shell completions](FAQ.md#complete)
* [Building](#building)
@@ -103,18 +105,23 @@ increases the times to `2.640s` for ripgrep and `10.277s` for GNU grep.
of search results, searching multiple patterns, highlighting matches with
color and full Unicode support. Unlike GNU grep, ripgrep stays fast while
supporting Unicode (which is always on).
* ripgrep has optional support for switching its regex engine to use PCRE2.
Among other things, this makes it possible to use look-around and
backreferences in your patterns, which are not supported in ripgrep's default
regex engine. PCRE2 support can be enabled with `-P/--pcre2` (use PCRE2
always) or `--auto-hybrid-regex` (use PCRE2 only if needed).
* ripgrep supports searching files in text encodings other than UTF-8, such
as UTF-16, latin-1, GBK, EUC-JP, Shift_JIS and more. (Some support for
automatically detecting UTF-16 is provided. Other text encodings must be
specifically specified with the `-E/--encoding` flag.)
* ripgrep supports searching files compressed in a common format (gzip, xz,
lzma, bzip2 or lz4) with the `-z/--search-zip` flag.
* ripgrep supports searching files compressed in a common format (brotli,
bzip2, gzip, lz4, lzma, xz, or zstandard) with the `-z/--search-zip` flag.
* ripgrep supports arbitrary input preprocessing filters which could be PDF
text extraction, less supported decompression, decrypting, automatic encoding
detection and so on.
In other words, use ripgrep if you like speed, filtering by default, fewer
bugs, and Unicode support.
bugs and Unicode support.
### Why shouldn't I use ripgrep?
@@ -131,8 +138,8 @@ or more of the following:
* You need a portable and ubiquitous tool. While ripgrep works on Windows,
macOS and Linux, it is not ubiquitous and it does not conform to any
standard such as POSIX. The best tool for this job is good old grep.
* There still exists some other minor feature (or bug) found in another tool
that isn't in ripgrep.
* There still exists some other feature (or bug) not listed in this README that
you rely on that's in another tool that isn't in ripgrep.
* There is a performance edge case where ripgrep doesn't do well where another
tool does do well. (Please file a bug report!)
* ripgrep isn't possible to install on your machine or isn't available for your
@@ -159,7 +166,7 @@ Summarizing, ripgrep is fast because:
latter is better for large directories. ripgrep chooses the best searching
strategy for you automatically.
* Applies your ignore patterns in `.gitignore` files using a
[`RegexSet`](https://docs.rs/regex/1.0.0/regex/struct.RegexSet.html).
[`RegexSet`](https://docs.rs/regex/1/regex/struct.RegexSet.html).
That means a single file path can be matched against multiple glob patterns
simultaneously.
* It uses a lock-free parallel recursive directory iterator, courtesy of
@@ -202,14 +209,6 @@ from homebrew-core, (compiled with rust stable, no SIMD):
$ brew install ripgrep
```
or you can install a binary compiled with rust nightly (including SIMD and all
optimizations) by utilizing a custom tap:
```
$ brew tap burntsushi/ripgrep https://github.com/BurntSushi/ripgrep.git
$ brew install ripgrep-bin
```
If you're a **MacPorts** user, then you can install ripgrep from the
[official ports](https://www.macports.org/ports.php?by=name&substr=ripgrep):
@@ -244,21 +243,22 @@ If you're a **Gentoo** user, you can install ripgrep from the
$ emerge sys-apps/ripgrep
```
If you're a **Fedora 27+** user, you can install ripgrep from official
If you're a **Fedora** user, you can install ripgrep from official
repositories.
```
$ sudo dnf install ripgrep
```
If you're a **Fedora 24+** user, you can install ripgrep from
[copr](https://copr.fedorainfracloud.org/coprs/carlwgeorge/ripgrep/):
If you're an **openSUSE Leap 15.0** user, you can install ripgrep from the
[utilities repo](https://build.opensuse.org/package/show/utilities/ripgrep):
```
$ sudo dnf copr enable carlwgeorge/ripgrep
$ sudo dnf install ripgrep
$ sudo zypper ar https://download.opensuse.org/repositories/utilities/openSUSE_Leap_15.0/utilities.repo
$ sudo zypper install ripgrep
```
If you're an **openSUSE Tumbleweed** user, you can install ripgrep from the
[official repo](http://software.opensuse.org/package/ripgrep):
@@ -284,20 +284,27 @@ $ # (Or using the attribute name, which is also ripgrep.)
If you're a **Debian** user (or a user of a Debian derivative like **Ubuntu**),
then ripgrep can be installed using a binary `.deb` file provided in each
[ripgrep release](https://github.com/BurntSushi/ripgrep/releases). Note that
ripgrep is not in the official Debian or Ubuntu repositories.
[ripgrep release](https://github.com/BurntSushi/ripgrep/releases).
```
$ curl -LO https://github.com/BurntSushi/ripgrep/releases/download/0.9.0/ripgrep_0.9.0_amd64.deb
$ sudo dpkg -i ripgrep_0.9.0_amd64.deb
$ curl -LO https://github.com/BurntSushi/ripgrep/releases/download/11.0.1/ripgrep_11.0.1_amd64.deb
$ sudo dpkg -i ripgrep_11.0.1_amd64.deb
```
If you run Debian Buster (currently Debian testing) or Debian sid, ripgrep is
If you run Debian Buster (currently Debian testing) or Debian sid, ripgrep is
[officially maintained by Debian](https://tracker.debian.org/pkg/rust-ripgrep).
```
$ sudo apt-get install ripgrep
```
If you're an **Ubuntu Cosmic (18.10)** (or newer) user, ripgrep is
[available](https://launchpad.net/ubuntu/+source/rust-ripgrep) using the same
packaging as Debian:
```
$ sudo apt-get install ripgrep
```
(N.B. Various snaps for ripgrep on Ubuntu are also available, but none of them
seem to work right and generate a number of very strange bug reports that I
don't know how to fix and don't have the time to fix. Therefore, it is no
@@ -326,7 +333,7 @@ If you're a **NetBSD** user, then you can install ripgrep from
If you're a **Rust programmer**, ripgrep can be installed with `cargo`.
* Note that the minimum supported version of Rust for ripgrep is **1.23.0**,
* Note that the minimum supported version of Rust for ripgrep is **1.34.0**,
although ripgrep may work with older versions.
* Note that the binary may be bigger than expected because it contains debug
symbols. This is intentional. To remove debug symbols and therefore reduce
@@ -336,18 +343,15 @@ If you're a **Rust programmer**, ripgrep can be installed with `cargo`.
$ cargo install ripgrep
```
When compiling with Rust 1.27 or newer, this will automatically enable SIMD
optimizations for search.
ripgrep isn't currently in any other package repositories.
[I'd like to change that](https://github.com/BurntSushi/ripgrep/issues/10).
### Building
ripgrep is written in Rust, so you'll need to grab a
[Rust installation](https://www.rust-lang.org/) in order to compile it.
ripgrep compiles with Rust 1.23.0 (stable) or newer. Building is easy:
ripgrep compiles with Rust 1.34.0 (stable) or newer. In general, ripgrep tracks
the latest stable release of the Rust compiler.
To build ripgrep:
```
$ git clone https://github.com/BurntSushi/ripgrep
@@ -361,18 +365,14 @@ If you have a Rust nightly compiler and a recent Intel CPU, then you can enable
additional optional SIMD acceleration like so:
```
RUSTFLAGS="-C target-cpu=native" cargo build --release --features 'simd-accel avx-accel'
RUSTFLAGS="-C target-cpu=native" cargo build --release --features 'simd-accel'
```
If your machine doesn't support AVX instructions, then simply remove
`avx-accel` from the features list. Similarly for SIMD (which corresponds
roughly to SSE instructions).
The `simd-accel` and `avx-accel` features enable SIMD support in certain
ripgrep dependencies (responsible for counting lines and transcoding). They
are not necessary to get SIMD optimizations for search; those are enabled
automatically. Hopefully, some day, the `simd-accel` and `avx-accel` features
will similarly become unnecessary.
The `simd-accel` feature enables SIMD support in certain ripgrep dependencies
(responsible for transcoding). They are not necessary to get SIMD optimizations
for search; those are enabled automatically. Hopefully, some day, the
`simd-accel` feature will similarly become unnecessary. **WARNING:** Currently,
enabling this option can increase compilation times dramatically.
Finally, optional PCRE2 support can be built with ripgrep by enabling the
`pcre2` feature:
@@ -381,15 +381,16 @@ Finally, optional PCRE2 support can be built with ripgrep by enabling the
$ cargo build --release --features 'pcre2'
```
(Tip: use `--features 'pcre2 simd-accel avx-accel'` to also include compile
time SIMD optimizations.)
(Tip: use `--features 'pcre2 simd-accel'` to also include compile time SIMD
optimizations, which will only work with a nightly compiler.)
Enabling the PCRE2 feature will attempt to automatically find and link with
your system's PCRE2 library via `pkg-config`. If one doesn't exist, then
ripgrep will build PCRE2 from source using your system's C compiler and then
statically link it into the final executable. Static linking can be forced even
when there is an available PCRE2 system library by either building ripgrep with
the MUSL target or by setting `PCRE2_SYS_STATIC=1`.
Enabling the PCRE2 feature works with a stable Rust compiler and will
attempt to automatically find and link with your system's PCRE2 library via
`pkg-config`. If one doesn't exist, then ripgrep will build PCRE2 from source
using your system's C compiler and then statically link it into the final
executable. Static linking can be forced even when there is an available PCRE2
system library by either building ripgrep with the MUSL target or by setting
`PCRE2_SYS_STATIC=1`.
ripgrep can be built with the MUSL target on Linux by first installing the MUSL
library on your system (consult your friendly neighborhood package manager).
@@ -401,7 +402,10 @@ $ rustup target add x86_64-unknown-linux-musl
$ cargo build --release --target x86_64-unknown-linux-musl
```
Applying the `--features` flag from above works as expected.
Applying the `--features` flag from above works as expected. If you want to
build a static executable with MUSL and with PCRE2, then you will need to have
`musl-gcc` installed, which might be in a separate package from the actual
MUSL library, depending on your Linux distribution.
### Running tests

View File

@@ -45,11 +45,10 @@ install:
- rustc -V
- cargo -V
# ???
# Hack to work around a harmless warning in Appveyor builds?
build: false
# Equivalent to Travis' `script` phase
# TODO modify this phase as you see fit
test_script:
- cargo test --verbose --all --features pcre2
@@ -60,7 +59,7 @@ before_deploy:
- copy target\release\rg.exe staging
- ps: copy target\release\build\ripgrep-*\out\_rg.ps1 staging
- cd staging
# release zipfile will look like 'rust-everywhere-v1.2.3-x86_64-pc-windows-msvc'
# release zipfile will look like 'ripgrep-1.2.3-x86_64-pc-windows-msvc'
- 7z a ../%PROJECT_NAME%-%APPVEYOR_REPO_TAG_NAME%-%TARGET%.zip *
- appveyor PushArtifact ../%PROJECT_NAME%-%APPVEYOR_REPO_TAG_NAME%-%TARGET%.zip
@@ -73,17 +72,10 @@ deploy:
provider: GitHub
# deploy when a new tag is pushed and only on the stable channel
on:
# channel to use to produce the release artifacts
# NOTE make sure you only release *once* per target
# TODO you may want to pick a different channel
CHANNEL: stable
appveyor_repo_tag: true
branches:
only:
- /\d+\.\d+\.\d+/
- /^\d+\.\d+\.\d+$/
- master
# - appveyor
# - /\d+\.\d+\.\d+/
# except:
# - master

View File

@@ -1,10 +1,4 @@
#[macro_use]
extern crate clap;
#[macro_use]
extern crate lazy_static;
use std::env;
use std::ffi::OsString;
use std::fs::{self, File};
use std::io::{self, Read, Write};
use std::path::Path;
@@ -19,22 +13,6 @@ use app::{RGArg, RGArgKind};
mod app;
fn main() {
// If our version of Rust has runtime SIMD detection, then set a cfg so
// we know we can test for it. We use this when generating ripgrep's
// --version output.
let version = rustc_version();
let parsed = match Version::parse(&version) {
Ok(parsed) => parsed,
Err(err) => {
eprintln!("failed to parse `rustc --version`: {}", err);
return;
}
};
let minimum = Version { major: 1, minor: 27, patch: 0 };
if version.contains("nightly") || parsed >= minimum {
println!("cargo:rustc-cfg=ripgrep_runtime_cpu");
}
// OUT_DIR is set by Cargo and it's where any additional build artifacts
// are written.
let outdir = match env::var_os("OUT_DIR") {
@@ -185,7 +163,12 @@ fn formatted_arg(arg: &RGArg) -> io::Result<String> {
}
fn formatted_doc_txt(arg: &RGArg) -> io::Result<String> {
let paragraphs: Vec<&str> = arg.doc_long.split("\n\n").collect();
let paragraphs: Vec<String> = arg.doc_long
.replace("{", "&#123;")
.replace("}", r"&#125;")
.split("\n\n")
.map(|s| s.to_string())
.collect();
if paragraphs.is_empty() {
return Err(ioerr(format!("missing docs for --{}", arg.name)));
}
@@ -199,63 +182,3 @@ fn formatted_doc_txt(arg: &RGArg) -> io::Result<String> {
fn ioerr(msg: String) -> io::Error {
io::Error::new(io::ErrorKind::Other, msg)
}
fn rustc_version() -> String {
let rustc = env::var_os("RUSTC").unwrap_or(OsString::from("rustc"));
let output = process::Command::new(&rustc)
.arg("--version")
.output()
.unwrap()
.stdout;
String::from_utf8(output).unwrap()
}
#[derive(Clone, Copy, Debug, Eq, PartialEq, PartialOrd, Ord)]
struct Version {
major: u32,
minor: u32,
patch: u32,
}
impl Version {
fn parse(mut s: &str) -> Result<Version, String> {
if !s.starts_with("rustc ") {
return Err(format!("unrecognized version string: {}", s));
}
s = &s["rustc ".len()..];
let parts: Vec<&str> = s.split(".").collect();
if parts.len() < 3 {
return Err(format!("not enough version parts: {:?}", parts));
}
let mut num = String::new();
for c in parts[0].chars() {
if !c.is_digit(10) {
break;
}
num.push(c);
}
let major = num.parse::<u32>().map_err(|e| e.to_string())?;
num.clear();
for c in parts[1].chars() {
if !c.is_digit(10) {
break;
}
num.push(c);
}
let minor = num.parse::<u32>().map_err(|e| e.to_string())?;
num.clear();
for c in parts[2].chars() {
if !c.is_digit(10) {
break;
}
num.push(c);
}
let patch = num.parse::<u32>().map_err(|e| e.to_string())?;
Ok(Version { major, minor, patch })
}
}

View File

@@ -8,10 +8,13 @@ set -ex
# Generate artifacts for release
mk_artifacts() {
CARGO="$(builder)"
if is_arm; then
cargo build --target "$TARGET" --release
"$CARGO" build --target "$TARGET" --release
else
cargo build --target "$TARGET" --release --features 'pcre2'
# Technically, MUSL builds will force PCRE2 to get statically compiled,
# but we also want PCRE2 statically build for macOS binaries.
PCRE2_SYS_STATIC=1 "$CARGO" build --target "$TARGET" --release --features 'pcre2'
fi
}

43
ci/build_deb.sh Executable file
View File

@@ -0,0 +1,43 @@
#!/bin/bash
set -e
# This script builds a binary dpkg for Debian based distros. It does not
# currently run in CI, and is instead run manually and the resulting dpkg is
# uploaded to GitHub via the web UI.
#
# Note that this requires 'cargo deb', which can be installed with
# 'cargo install cargo-deb'.
#
# This should be run from the root of the ripgrep repo.
if ! command -V cargo-deb > /dev/null 2>&1; then
echo "cargo-deb command missing" >&2
exit 1
fi
# 'cargo deb' does not seem to provide a way to specify an asset that is
# created at build time, such as ripgrep's man page. To work around this,
# we force a debug build, copy out the man page (and shell completions)
# produced from that build, put it into a predictable location and then build
# the deb, which knows where to look.
DEPLOY_DIR=deployment/deb
mkdir -p "$DEPLOY_DIR"
cargo build
# Find and copy man page.
manpage="$(find ./target/debug -name rg.1 -print0 | xargs -0 ls -t | head -n1)"
cp "$manpage" "$DEPLOY_DIR/"
# Do the same for shell completions.
compbash="$(find ./target/debug -name rg.bash -print0 | xargs -0 ls -t | head -n1)"
cp "$compbash" "$DEPLOY_DIR/"
compfish="$(find ./target/debug -name rg.fish -print0 | xargs -0 ls -t | head -n1)"
cp "$compfish" "$DEPLOY_DIR/"
compzsh="complete/_rg"
cp "$compzsh" "$DEPLOY_DIR/"
# Since we're distributing the dpkg, we don't know whether the user will have
# PCRE2 installed, so just do a static build.
PCRE2_SYS_STATIC=1 cargo deb

View File

@@ -7,11 +7,13 @@ set -ex
. "$(dirname $0)/utils.sh"
main() {
CARGO="$(builder)"
# Test a normal debug build.
if is_arm; then
cargo build --target "$TARGET" --verbose
"$CARGO" build --target "$TARGET" --verbose
else
cargo build --target "$TARGET" --verbose --all --features 'pcre2'
"$CARGO" build --target "$TARGET" --verbose --all --features 'pcre2'
fi
# Show the output of the most recent build.rs stderr.
@@ -44,7 +46,7 @@ main() {
"$(dirname "${0}")/test_complete.sh"
# Run tests for ripgrep and all sub-crates.
cargo test --target "$TARGET" --verbose --all --features 'pcre2'
"$CARGO" test --target "$TARGET" --verbose --all --features 'pcre2'
}
main

View File

@@ -55,6 +55,13 @@ gcc_prefix() {
esac
}
is_musl() {
case "$TARGET" in
*-musl) return 0 ;;
*) return 1 ;;
esac
}
is_x86() {
case "$(architecture)" in
amd64|i386) return 0 ;;
@@ -62,6 +69,13 @@ is_x86() {
esac
}
is_x86_64() {
case "$(architecture)" in
amd64) return 0 ;;
*) return 1 ;;
esac
}
is_arm() {
case "$(architecture)" in
armhf) return 0 ;;
@@ -82,3 +96,12 @@ is_osx() {
*) return 1 ;;
esac
}
builder() {
if is_musl && is_x86_64; then
cargo install cross
echo "cross"
else
echo "cargo"
fi
}

View File

@@ -43,6 +43,13 @@ _rg() {
+ '(exclusive)' # Misc. fully exclusive options
'(: * -)'{-h,--help}'[display help information]'
'(: * -)'{-V,--version}'[display version information]'
'(: * -)'--pcre2-version'[print the version of PCRE2 used by ripgrep, if available]'
+ '(buffered)' # buffering options
'--line-buffered[force line buffering]'
$no"--no-line-buffered[don't force line buffering]"
'--block-buffered[force block buffering]'
$no"--no-block-buffered[don't force block buffering]"
+ '(case)' # Case-sensitivity options
{-i,--ignore-case}'[search case-insensitively]'
@@ -71,7 +78,7 @@ _rg() {
$no'--no-encoding[use default text encoding]'
+ file # File-input options
'*'{-f+,--file=}'[specify file containing patterns to search for]: :_files'
'(1)*'{-f+,--file=}'[specify file containing patterns to search for]: :_files'
+ '(file-match)' # Files with/without match options
'(stats)'{-l,--files-with-matches}'[only show names of files with matches]'
@@ -79,7 +86,11 @@ _rg() {
+ '(file-name)' # File-name options
{-H,--with-filename}'[show file name for matches]'
"--no-filename[don't show file name for matches]"
{-I,--no-filename}"[don't show file name for matches]"
+ '(file-system)' # File system options
"--one-file-system[don't descend into directories on other file systems]"
$no'--no-one-file-system[descend into directories on other file systems]'
+ '(fixed)' # Fixed-string options
{-F,--fixed-strings}'[treat pattern as literal string instead of regular expression]'
@@ -101,9 +112,17 @@ _rg() {
'--hidden[search hidden files and directories]'
$no"--no-hidden[don't search hidden files and directories]"
+ '(hybrid)' # hybrid regex options
'--auto-hybrid-regex[dynamically use PCRE2 if necessary]'
$no"--no-auto-hybrid-regex[don't dynamically use PCRE2 if necessary]"
+ '(ignore)' # Ignore-file options
"(--no-ignore-global --no-ignore-parent --no-ignore-vcs)--no-ignore[don't respect ignore files]"
$no'(--ignore-global --ignore-parent --ignore-vcs)--ignore[respect ignore files]'
"(--no-ignore-global --no-ignore-parent --no-ignore-vcs --no-ignore-dot)--no-ignore[don't respect ignore files]"
$no'(--ignore-global --ignore-parent --ignore-vcs --ignore-dot)--ignore[respect ignore files]'
+ '(ignore-file-case-insensitive)' # Ignore-file case sensitivity options
'--ignore-file-case-insensitive[process ignore files case insensitively]'
$no'--no-ignore-file-case-insensitive[process ignore files case sensitively]'
+ '(ignore-global)' # Global ignore-file options
"--no-ignore-global[don't respect global ignore files]"
@@ -117,6 +136,10 @@ _rg() {
"--no-ignore-vcs[don't respect version control ignore files]"
$no'--ignore-vcs[respect version control ignore files]'
+ '(ignore-dot)' # .ignore-file options
"--no-ignore-dot[don't respect .ignore files]"
$no'--ignore-dot[respect .ignore files]'
+ '(json)' # JSON options
'--json[output results in JSON Lines format]'
$no"--no-json[don't output results in JSON Lines format]"
@@ -130,6 +153,10 @@ _rg() {
$no"--no-crlf[don't use CRLF as line terminator]"
'(text)--null-data[use NUL as line terminator]'
+ '(max-columns-preview)' # max column preview options
'--max-columns-preview[show preview for long lines (with -M)]'
$no"--no-max-columns-preview[don't show preview for long lines (with -M)]"
+ '(max-depth)' # Directory-depth options
'--max-depth=[specify max number of directories to descend]:number of directories'
'!--maxdepth=:number of directories'
@@ -166,13 +193,16 @@ _rg() {
$no'(pcre2-unicode)--no-pcre2[disable matching with PCRE2]'
+ '(pcre2-unicode)' # PCRE2 Unicode options
$no'(--no-pcre2-unicode)--pcre2-unicode[enable PCRE2 Unicode mode (with -P)]'
'(--no-pcre2-unicode)--no-pcre2-unicode[disable PCRE2 Unicode mode (with -P)]'
$no'(--no-pcre2 --no-pcre2-unicode)--pcre2-unicode[enable PCRE2 Unicode mode (with -P)]'
'(--no-pcre2 --pcre2-unicode)--no-pcre2-unicode[disable PCRE2 Unicode mode (with -P)]'
+ '(pre)' # Preprocessing options
'(-z --search-zip)--pre=[specify preprocessor utility]:preprocessor utility:_command_names -e'
$no'--no-pre[disable preprocessor utility]'
+ pre-glob # Preprocessing glob options
'*--pre-glob[include/exclude files for preprocessing with --pre]'
+ '(pretty-vimgrep)' # Pretty/vimgrep display options
'(heading)'{-p,--pretty}'[alias for --color=always --heading -n]'
'(heading passthru)--vimgrep[show results in vim-compatible format]'
@@ -184,8 +214,21 @@ _rg() {
{-r+,--replace=}'[specify string used to replace matches]:replace string'
+ '(sort)' # File-sorting options
'(threads)--sort-files[sort results by file path (disables parallelism)]'
$no"--no-sort-files[don't sort results by file path]"
'(threads)--sort=[sort results in ascending order (disables parallelism)]:sort method:((
none\:"no sorting"
path\:"sort by file path"
modified\:"sort by last modified time"
accessed\:"sort by last accessed time"
created\:"sort by creation time"
))'
'(threads)--sortr=[sort results in descending order (disables parallelism)]:sort method:((
none\:"no sorting"
path\:"sort by file path"
modified\:"sort by last modified time"
accessed\:"sort by last accessed time"
created\:"sort by creation time"
))'
'!(threads)--sort-files[sort results by file path (disables parallelism)]'
+ '(stats)' # Statistics options
'(--files file-match)--stats[show search statistics]'
@@ -193,10 +236,12 @@ _rg() {
+ '(text)' # Binary-search options
{-a,--text}'[search binary files as if they were text]'
"--binary[search binary files, don't print binary data]"
$no"--no-binary[don't search binary files]"
$no"(--null-data)--no-text[don't search binary files as if they were text]"
+ '(threads)' # Thread-count options
'(--sort-files)'{-j+,--threads=}'[specify approximate number of threads to use]:number of threads'
'(sort)'{-j+,--threads=}'[specify approximate number of threads to use]:number of threads'
+ '(trim)' # Trim options
'--trim[trim any ASCII whitespace prefix from each line]'
@@ -344,7 +389,7 @@ _rg_encodings() {
shift{-,_}jis csshiftjis {,x-}sjis ms_kanji ms932
utf{,-}8 utf-16{,be,le} unicode-1-1-utf-8
windows-{31j,874,949,125{0..8}} dos-874 tis-620 ansi_x3.4-1968
x-user-defined auto
x-user-defined auto none
)
_wanted encodings expl encoding compadd -a "$@" - _encodings

View File

@@ -28,27 +28,40 @@ Synopsis
DESCRIPTION
-----------
ripgrep (rg) recursively searches your current directory for a regex pattern.
By default, ripgrep will respect your `.gitignore` and automatically skip
hidden files/directories and binary files.
By default, ripgrep will respect your .gitignore and automatically skip hidden
files/directories and binary files.
ripgrep's regex engine uses finite automata and guarantees linear time
searching. Because of this, features like backreferences and arbitrary
lookaround are not supported.
ripgrep's default regex engine uses finite automata and guarantees linear
time searching. Because of this, features like backreferences and arbitrary
look-around are not supported. However, if ripgrep is built with PCRE2, then
the *--pcre2* flag can be used to enable backreferences and look-around.
ripgrep supports configuration files. Set *RIPGREP_CONFIG_PATH* to a
configuration file. The file can specify one shell argument per line. Lines
starting with *#* are ignored. For more details, see the man page or the
*README*.
Tip: to disable all smart filtering and make ripgrep behave a bit more like
classical grep, use *rg -uuu*.
REGEX SYNTAX
------------
ripgrep uses Rust's regex engine, which documents its syntax:
https://docs.rs/regex/0.2.5/regex/#syntax
ripgrep uses Rust's regex engine by default, which documents its syntax:
https://docs.rs/regex/*/regex/#syntax
ripgrep uses byte-oriented regexes, which has some additional documentation:
https://docs.rs/regex/0.2.5/regex/bytes/index.html#syntax
https://docs.rs/regex/*/regex/bytes/index.html#syntax
To a first approximation, ripgrep uses Perl-like regexes without look-around or
backreferences. This makes them very similar to the "extended" (ERE) regular
expressions supported by `egrep`, but with a few additional features like
expressions supported by *egrep*, but with a few additional features like
Unicode character classes.
If you're using ripgrep with the *--pcre2* flag, then please consult
https://www.pcre.org or the PCRE2 man pages for documentation on the supported
syntax.
POSITIONAL ARGUMENTS
--------------------
@@ -58,18 +71,37 @@ _PATTERN_::
_PATH_::
A file or directory to search. Directories are searched recursively. Paths
specified expicitly on the command line override glob and ignore rules.
specified explicitly on the command line override glob and ignore rules.
OPTIONS
-------
Note that for many options, there exist flags to disable them. In some cases,
those flags are not listed in a first class way below. For example, the
*--column* flag (listed below) enables column numbers in ripgrep's output, but
the *--no-column* flag (not listed below) disables them. The reverse can also
exist. For example, the *--no-ignore* flag (listed below) disables ripgrep's
*gitignore* logic, but the *--ignore* flag (not listed below) enables it. These
flags are useful for overriding a ripgrep configuration file on the command
line. Each flag's documentation notes whether an inverted flag exists. In all
cases, the flag specified last takes precedence.
{OPTIONS}
EXIT STATUS
-----------
If ripgrep finds a match, then the exit status of the program is 0. If no match
could be found, then the exit status is non-zero.
could be found, then the exit status is 1. If an error occurred, then the exit
status is always 2 unless ripgrep was run with the *--quiet* flag and a match
was found. In summary:
* `0` exit status occurs only when at least one match was found, and if
no error occurred, unless *--quiet* was given.
* `1` exit status occurs only when no match was found and no error occurred.
* `2` exit status occurs when an error occurred. This is true for both
catastrophic errors (e.g., a regex syntax error) and for soft errors (e.g.,
unable to read a file).
CONFIGURATION FILES
@@ -78,12 +110,12 @@ ripgrep supports reading configuration files that change ripgrep's default
behavior. The format of the configuration file is an "rc" style and is very
simple. It is defined by two rules:
1. Every line is a shell argument, after trimming ASCII whitespace.
2. Lines starting with _#_ (optionally preceded by any amount of
ASCII whitespace) are ignored.
1. Every line is a shell argument, after trimming whitespace.
2. Lines starting with *#* (optionally preceded by any amount of
whitespace) are ignored.
ripgrep will look for a single configuration file if and only if the
_RIPGREP_CONFIG_PATH_ environment variable is set and is non-empty.
*RIPGREP_CONFIG_PATH* environment variable is set and is non-empty.
ripgrep will parse shell arguments from this file on startup and will
behave as if the arguments in this file were prepended to any explicit
arguments given to ripgrep on the command line.
@@ -145,20 +177,35 @@ SHELL COMPLETION
Shell completion files are included in the release tarball for Bash, Fish, Zsh
and PowerShell.
For *bash*, move `rg.bash` to `$XDG_CONFIG_HOME/bash_completion`
or `/etc/bash_completion.d/`.
For *bash*, move *rg.bash* to *$XDG_CONFIG_HOME/bash_completion*
or */etc/bash_completion.d/*.
For *fish*, move `rg.fish` to `$HOME/.config/fish/completions`.
For *fish*, move *rg.fish* to *$HOME/.config/fish/completions*.
For *zsh*, move `_rg` to one of your `$fpath` directories.
For *zsh*, move *_rg* to one of your *$fpath* directories.
CAVEATS
-------
ripgrep may abort unexpectedly when using default settings if it searches a
file that is simultaneously truncated. This behavior can be avoided by passing
the --no-mmap flag which will forcefully disable the use of memory maps in all
cases.
the *--no-mmap* flag which will forcefully disable the use of memory maps in
all cases.
ripgrep may use a large amount of memory depending on a few factors. Firstly,
if ripgrep uses parallelism for search (the default), then the entire output
for each individual file is buffered into memory in order to prevent
interleaving matches in the output. To avoid this, you can disable parallelism
with the *-j1* flag. Secondly, ripgrep always needs to have at least a single
line in memory in order to execute a search. A file with a very long line can
thus cause ripgrep to use a lot of memory. Generally, this only occurs when
searching binary data with the *-a* flag enabled. (When the *-a* flag isn't
enabled, ripgrep will replace all NUL bytes with line terminators, which
typically prevents exorbitant memory usage.) Thirdly, when ripgrep searches
a large file using a memory map, the process will report its resident memory
usage as the size of the file. However, this does not mean ripgrep actually
needed to use that much memory; the operating system will generally handle this
for you.
VERSION
@@ -170,7 +217,11 @@ HOMEPAGE
--------
https://github.com/BurntSushi/ripgrep
Please report bugs and feature requests in the issue tracker.
Please report bugs and feature requests in the issue tracker. Please do your
best to provide a reproducible test case for bugs. This should include the
corpus being searched, the *rg* command, the actual output and the expected
output. Please also include the output of running the same *rg* command but
with the *--debug* flag.
AUTHORS

View File

@@ -1,6 +1,6 @@
[package]
name = "globset"
version = "0.4.1" #:version
version = "0.4.4" #:version
authors = ["Andrew Gallant <jamslam@gmail.com>"]
description = """
Cross platform single glob and glob set matching. Glob set matching is the
@@ -19,14 +19,14 @@ name = "globset"
bench = false
[dependencies]
aho-corasick = "0.6.0"
fnv = "1.0"
log = "0.4"
memchr = "2"
regex = "1"
aho-corasick = "0.7.3"
bstr = { version = "0.2.0", default-features = false, features = ["std"] }
fnv = "1.0.6"
log = "0.4.5"
regex = "1.1.5"
[dev-dependencies]
glob = "0.2"
glob = "0.3.0"
[features]
simd-accel = []

View File

@@ -837,40 +837,66 @@ impl<'a> Parser<'a> {
fn parse_star(&mut self) -> Result<(), Error> {
let prev = self.prev;
if self.chars.peek() != Some(&'*') {
if self.peek() != Some('*') {
self.push_token(Token::ZeroOrMore)?;
return Ok(());
}
assert!(self.bump() == Some('*'));
if !self.have_tokens()? {
self.push_token(Token::RecursivePrefix)?;
let next = self.bump();
if !next.map(is_separator).unwrap_or(true) {
return Err(self.error(ErrorKind::InvalidRecursive));
if !self.peek().map_or(true, is_separator) {
self.push_token(Token::ZeroOrMore)?;
self.push_token(Token::ZeroOrMore)?;
} else {
self.push_token(Token::RecursivePrefix)?;
assert!(self.bump().map_or(true, is_separator));
}
return Ok(());
}
self.pop_token()?;
if !prev.map(is_separator).unwrap_or(false) {
if self.stack.len() <= 1
|| (prev != Some(',') && prev != Some('{')) {
return Err(self.error(ErrorKind::InvalidRecursive));
|| (prev != Some(',') && prev != Some('{'))
{
self.push_token(Token::ZeroOrMore)?;
self.push_token(Token::ZeroOrMore)?;
return Ok(());
}
}
match self.chars.peek() {
None => {
assert!(self.bump().is_none());
self.push_token(Token::RecursiveSuffix)
let is_suffix =
match self.peek() {
None => {
assert!(self.bump().is_none());
true
}
Some(',') | Some('}') if self.stack.len() >= 2 => {
true
}
Some(c) if is_separator(c) => {
assert!(self.bump().map(is_separator).unwrap_or(false));
false
}
_ => {
self.push_token(Token::ZeroOrMore)?;
self.push_token(Token::ZeroOrMore)?;
return Ok(());
}
};
match self.pop_token()? {
Token::RecursivePrefix => {
self.push_token(Token::RecursivePrefix)?;
}
Some(&',') | Some(&'}') if self.stack.len() >= 2 => {
self.push_token(Token::RecursiveSuffix)
Token::RecursiveSuffix => {
self.push_token(Token::RecursiveSuffix)?;
}
Some(&c) if is_separator(c) => {
assert!(self.bump().map(is_separator).unwrap_or(false));
self.push_token(Token::RecursiveZeroOrMore)
_ => {
if is_suffix {
self.push_token(Token::RecursiveSuffix)?;
} else {
self.push_token(Token::RecursiveZeroOrMore)?;
}
}
_ => Err(self.error(ErrorKind::InvalidRecursive)),
}
Ok(())
}
fn parse_class(&mut self) -> Result<(), Error> {
@@ -959,6 +985,10 @@ impl<'a> Parser<'a> {
self.cur = self.chars.next();
self.cur
}
fn peek(&mut self) -> Option<char> {
self.chars.peek().map(|&ch| ch)
}
}
#[cfg(test)]
@@ -1144,13 +1174,6 @@ mod tests {
syntax!(cls20, "[^a]", vec![classn('a', 'a')]);
syntax!(cls21, "[^a-z]", vec![classn('a', 'z')]);
syntaxerr!(err_rseq1, "a**", ErrorKind::InvalidRecursive);
syntaxerr!(err_rseq2, "**a", ErrorKind::InvalidRecursive);
syntaxerr!(err_rseq3, "a**b", ErrorKind::InvalidRecursive);
syntaxerr!(err_rseq4, "***", ErrorKind::InvalidRecursive);
syntaxerr!(err_rseq5, "/a**", ErrorKind::InvalidRecursive);
syntaxerr!(err_rseq6, "/**a", ErrorKind::InvalidRecursive);
syntaxerr!(err_rseq7, "/a**b", ErrorKind::InvalidRecursive);
syntaxerr!(err_unclosed1, "[", ErrorKind::UnclosedClass);
syntaxerr!(err_unclosed2, "[]", ErrorKind::UnclosedClass);
syntaxerr!(err_unclosed3, "[!", ErrorKind::UnclosedClass);
@@ -1194,8 +1217,30 @@ mod tests {
toregex!(re8, "[*]", r"^[\*]$");
toregex!(re9, "[+]", r"^[\+]$");
toregex!(re10, "+", r"^\+$");
toregex!(re11, "**", r"^.*$");
toregex!(re12, "", r"^\xe2\x98\x83$");
toregex!(re11, "", r"^\xe2\x98\x83$");
toregex!(re12, "**", r"^.*$");
toregex!(re13, "**/", r"^.*$");
toregex!(re14, "**/*", r"^(?:/?|.*/).*$");
toregex!(re15, "**/**", r"^.*$");
toregex!(re16, "**/**/*", r"^(?:/?|.*/).*$");
toregex!(re17, "**/**/**", r"^.*$");
toregex!(re18, "**/**/**/*", r"^(?:/?|.*/).*$");
toregex!(re19, "a/**", r"^a(?:/?|/.*)$");
toregex!(re20, "a/**/**", r"^a(?:/?|/.*)$");
toregex!(re21, "a/**/**/**", r"^a(?:/?|/.*)$");
toregex!(re22, "a/**/b", r"^a(?:/|/.*/)b$");
toregex!(re23, "a/**/**/b", r"^a(?:/|/.*/)b$");
toregex!(re24, "a/**/**/**/b", r"^a(?:/|/.*/)b$");
toregex!(re25, "**/b", r"^(?:/?|.*/)b$");
toregex!(re26, "**/**/b", r"^(?:/?|.*/)b$");
toregex!(re27, "**/**/**/b", r"^(?:/?|.*/)b$");
toregex!(re28, "a**", r"^a.*.*$");
toregex!(re29, "**a", r"^.*.*a$");
toregex!(re30, "a**b", r"^a.*.*b$");
toregex!(re31, "***", r"^.*.*.*$");
toregex!(re32, "/a**", r"^/a.*.*$");
toregex!(re33, "/**a", r"^/.*.*a$");
toregex!(re34, "/a**b", r"^/a.*.*b$");
matches!(match1, "a", "a");
matches!(match2, "a*b", "a_b");

View File

@@ -104,27 +104,25 @@ or to enable case insensitive matching.
#![deny(missing_docs)]
extern crate aho_corasick;
extern crate bstr;
extern crate fnv;
#[macro_use]
extern crate log;
extern crate memchr;
extern crate regex;
use std::borrow::Cow;
use std::collections::{BTreeMap, HashMap};
use std::error::Error as StdError;
use std::ffi::OsStr;
use std::fmt;
use std::hash;
use std::path::Path;
use std::str;
use aho_corasick::{Automaton, AcAutomaton, FullAcAutomaton};
use aho_corasick::AhoCorasick;
use bstr::{B, ByteSlice, ByteVec};
use regex::bytes::{Regex, RegexBuilder, RegexSet};
use pathutil::{
file_name, file_name_ext, normalize_path, os_str_bytes, path_bytes,
};
use pathutil::{file_name, file_name_ext, normalize_path};
use glob::MatchStrategy;
pub use glob::{Glob, GlobBuilder, GlobMatcher};
@@ -143,8 +141,13 @@ pub struct Error {
/// The kind of error that can occur when parsing a glob pattern.
#[derive(Clone, Debug, Eq, PartialEq)]
pub enum ErrorKind {
/// Occurs when a use of `**` is invalid. Namely, `**` can only appear
/// adjacent to a path separator, or the beginning/end of a glob.
/// **DEPRECATED**.
///
/// This error used to occur for consistency with git's glob specification,
/// but the specification now accepts all uses of `**`. When `**` does not
/// appear adjacent to a path separator or at the beginning/end of a glob,
/// it is now treated as two consecutive `*` patterns. As such, this error
/// is no longer used.
InvalidRecursive,
/// Occurs when a character class (e.g., `[abc]`) is not closed.
UnclosedClass,
@@ -289,6 +292,7 @@ pub struct GlobSet {
impl GlobSet {
/// Create an empty `GlobSet`. An empty set matches nothing.
#[inline]
pub fn empty() -> GlobSet {
GlobSet {
len: 0,
@@ -297,11 +301,13 @@ impl GlobSet {
}
/// Returns true if this set is empty, and therefore matches nothing.
#[inline]
pub fn is_empty(&self) -> bool {
self.len == 0
}
/// Returns the number of globs in this set.
#[inline]
pub fn len(&self) -> usize {
self.len
}
@@ -492,12 +498,13 @@ pub struct Candidate<'a> {
impl<'a> Candidate<'a> {
/// Create a new candidate for matching from the given path.
pub fn new<P: AsRef<Path> + ?Sized>(path: &'a P) -> Candidate<'a> {
let path = path.as_ref();
let basename = file_name(path).unwrap_or(OsStr::new(""));
let path = normalize_path(Vec::from_path_lossy(path.as_ref()));
let basename = file_name(&path).unwrap_or(Cow::Borrowed(B("")));
let ext = file_name_ext(&basename).unwrap_or(Cow::Borrowed(B("")));
Candidate {
path: normalize_path(path_bytes(path)),
basename: os_str_bytes(basename),
ext: file_name_ext(basename).unwrap_or(Cow::Borrowed(b"")),
path: path,
basename: basename,
ext: ext,
}
}
@@ -570,12 +577,12 @@ impl LiteralStrategy {
}
fn is_match(&self, candidate: &Candidate) -> bool {
self.0.contains_key(&*candidate.path)
self.0.contains_key(candidate.path.as_bytes())
}
#[inline(never)]
fn matches_into(&self, candidate: &Candidate, matches: &mut Vec<usize>) {
if let Some(hits) = self.0.get(&*candidate.path) {
if let Some(hits) = self.0.get(candidate.path.as_bytes()) {
matches.extend(hits);
}
}
@@ -597,7 +604,7 @@ impl BasenameLiteralStrategy {
if candidate.basename.is_empty() {
return false;
}
self.0.contains_key(&*candidate.basename)
self.0.contains_key(candidate.basename.as_bytes())
}
#[inline(never)]
@@ -605,7 +612,7 @@ impl BasenameLiteralStrategy {
if candidate.basename.is_empty() {
return;
}
if let Some(hits) = self.0.get(&*candidate.basename) {
if let Some(hits) = self.0.get(candidate.basename.as_bytes()) {
matches.extend(hits);
}
}
@@ -627,7 +634,7 @@ impl ExtensionStrategy {
if candidate.ext.is_empty() {
return false;
}
self.0.contains_key(&*candidate.ext)
self.0.contains_key(candidate.ext.as_bytes())
}
#[inline(never)]
@@ -635,7 +642,7 @@ impl ExtensionStrategy {
if candidate.ext.is_empty() {
return;
}
if let Some(hits) = self.0.get(&*candidate.ext) {
if let Some(hits) = self.0.get(candidate.ext.as_bytes()) {
matches.extend(hits);
}
}
@@ -643,7 +650,7 @@ impl ExtensionStrategy {
#[derive(Clone, Debug)]
struct PrefixStrategy {
matcher: FullAcAutomaton<Vec<u8>>,
matcher: AhoCorasick,
map: Vec<usize>,
longest: usize,
}
@@ -651,8 +658,8 @@ struct PrefixStrategy {
impl PrefixStrategy {
fn is_match(&self, candidate: &Candidate) -> bool {
let path = candidate.path_prefix(self.longest);
for m in self.matcher.find_overlapping(path) {
if m.start == 0 {
for m in self.matcher.find_overlapping_iter(path) {
if m.start() == 0 {
return true;
}
}
@@ -661,9 +668,9 @@ impl PrefixStrategy {
fn matches_into(&self, candidate: &Candidate, matches: &mut Vec<usize>) {
let path = candidate.path_prefix(self.longest);
for m in self.matcher.find_overlapping(path) {
if m.start == 0 {
matches.push(self.map[m.pati]);
for m in self.matcher.find_overlapping_iter(path) {
if m.start() == 0 {
matches.push(self.map[m.pattern()]);
}
}
}
@@ -671,7 +678,7 @@ impl PrefixStrategy {
#[derive(Clone, Debug)]
struct SuffixStrategy {
matcher: FullAcAutomaton<Vec<u8>>,
matcher: AhoCorasick,
map: Vec<usize>,
longest: usize,
}
@@ -679,8 +686,8 @@ struct SuffixStrategy {
impl SuffixStrategy {
fn is_match(&self, candidate: &Candidate) -> bool {
let path = candidate.path_suffix(self.longest);
for m in self.matcher.find_overlapping(path) {
if m.end == path.len() {
for m in self.matcher.find_overlapping_iter(path) {
if m.end() == path.len() {
return true;
}
}
@@ -689,9 +696,9 @@ impl SuffixStrategy {
fn matches_into(&self, candidate: &Candidate, matches: &mut Vec<usize>) {
let path = candidate.path_suffix(self.longest);
for m in self.matcher.find_overlapping(path) {
if m.end == path.len() {
matches.push(self.map[m.pati]);
for m in self.matcher.find_overlapping_iter(path) {
if m.end() == path.len() {
matches.push(self.map[m.pattern()]);
}
}
}
@@ -705,11 +712,11 @@ impl RequiredExtensionStrategy {
if candidate.ext.is_empty() {
return false;
}
match self.0.get(&*candidate.ext) {
match self.0.get(candidate.ext.as_bytes()) {
None => false,
Some(regexes) => {
for &(_, ref re) in regexes {
if re.is_match(&*candidate.path) {
if re.is_match(candidate.path.as_bytes()) {
return true;
}
}
@@ -723,9 +730,9 @@ impl RequiredExtensionStrategy {
if candidate.ext.is_empty() {
return;
}
if let Some(regexes) = self.0.get(&*candidate.ext) {
if let Some(regexes) = self.0.get(candidate.ext.as_bytes()) {
for &(global_index, ref re) in regexes {
if re.is_match(&*candidate.path) {
if re.is_match(candidate.path.as_bytes()) {
matches.push(global_index);
}
}
@@ -741,11 +748,11 @@ struct RegexSetStrategy {
impl RegexSetStrategy {
fn is_match(&self, candidate: &Candidate) -> bool {
self.matcher.is_match(&*candidate.path)
self.matcher.is_match(candidate.path.as_bytes())
}
fn matches_into(&self, candidate: &Candidate, matches: &mut Vec<usize>) {
for i in self.matcher.matches(&*candidate.path) {
for i in self.matcher.matches(candidate.path.as_bytes()) {
matches.push(self.map[i]);
}
}
@@ -776,18 +783,16 @@ impl MultiStrategyBuilder {
}
fn prefix(self) -> PrefixStrategy {
let it = self.literals.into_iter().map(|s| s.into_bytes());
PrefixStrategy {
matcher: AcAutomaton::new(it).into_full(),
matcher: AhoCorasick::new_auto_configured(&self.literals),
map: self.map,
longest: self.longest,
}
}
fn suffix(self) -> SuffixStrategy {
let it = self.literals.into_iter().map(|s| s.into_bytes());
SuffixStrategy {
matcher: AcAutomaton::new(it).into_full(),
matcher: AhoCorasick::new_auto_configured(&self.literals),
map: self.map,
longest: self.longest,
}

View File

@@ -1,41 +1,26 @@
use std::borrow::Cow;
use std::ffi::OsStr;
use std::path::Path;
use bstr::{ByteSlice, ByteVec};
/// The final component of the path, if it is a normal file.
///
/// If the path terminates in ., .., or consists solely of a root of prefix,
/// file_name will return None.
#[cfg(unix)]
pub fn file_name<'a, P: AsRef<Path> + ?Sized>(
path: &'a P,
) -> Option<&'a OsStr> {
use std::os::unix::ffi::OsStrExt;
use memchr::memrchr;
let path = path.as_ref().as_os_str().as_bytes();
pub fn file_name<'a>(path: &Cow<'a, [u8]>) -> Option<Cow<'a, [u8]>> {
if path.is_empty() {
return None;
} else if path.len() == 1 && path[0] == b'.' {
return None;
} else if path.last() == Some(&b'.') {
return None;
} else if path.len() >= 2 && &path[path.len() - 2..] == &b".."[..] {
} else if path.last_byte() == Some(b'.') {
return None;
}
let last_slash = memrchr(b'/', path).map(|i| i + 1).unwrap_or(0);
Some(OsStr::from_bytes(&path[last_slash..]))
}
/// The final component of the path, if it is a normal file.
///
/// If the path terminates in ., .., or consists solely of a root of prefix,
/// file_name will return None.
#[cfg(not(unix))]
pub fn file_name<'a, P: AsRef<Path> + ?Sized>(
path: &'a P,
) -> Option<&'a OsStr> {
path.as_ref().file_name()
let last_slash = path.rfind_byte(b'/').map(|i| i + 1).unwrap_or(0);
Some(match *path {
Cow::Borrowed(path) => Cow::Borrowed(&path[last_slash..]),
Cow::Owned(ref path) => {
let mut path = path.clone();
path.drain_bytes(..last_slash);
Cow::Owned(path)
}
})
}
/// Return a file extension given a path's file name.
@@ -54,55 +39,24 @@ pub fn file_name<'a, P: AsRef<Path> + ?Sized>(
/// a pattern like `*.rs` is obviously trying to match files with a `rs`
/// extension, but it also matches files like `.rs`, which doesn't have an
/// extension according to std::path::Path::extension.
pub fn file_name_ext(name: &OsStr) -> Option<Cow<[u8]>> {
pub fn file_name_ext<'a>(name: &Cow<'a, [u8]>) -> Option<Cow<'a, [u8]>> {
if name.is_empty() {
return None;
}
let name = os_str_bytes(name);
let last_dot_at = {
let result = name
.iter().enumerate().rev()
.find(|&(_, &b)| b == b'.')
.map(|(i, _)| i);
match result {
None => return None,
Some(i) => i,
}
let last_dot_at = match name.rfind_byte(b'.') {
None => return None,
Some(i) => i,
};
Some(match name {
Some(match *name {
Cow::Borrowed(name) => Cow::Borrowed(&name[last_dot_at..]),
Cow::Owned(mut name) => {
name.drain(..last_dot_at);
Cow::Owned(ref name) => {
let mut name = name.clone();
name.drain_bytes(..last_dot_at);
Cow::Owned(name)
}
})
}
/// Return raw bytes of a path, transcoded to UTF-8 if necessary.
pub fn path_bytes(path: &Path) -> Cow<[u8]> {
os_str_bytes(path.as_os_str())
}
/// Return the raw bytes of the given OS string, possibly transcoded to UTF-8.
#[cfg(unix)]
pub fn os_str_bytes(s: &OsStr) -> Cow<[u8]> {
use std::os::unix::ffi::OsStrExt;
Cow::Borrowed(s.as_bytes())
}
/// Return the raw bytes of the given OS string, possibly transcoded to UTF-8.
#[cfg(not(unix))]
pub fn os_str_bytes(s: &OsStr) -> Cow<[u8]> {
// TODO(burntsushi): On Windows, OS strings are WTF-8, which is a superset
// of UTF-8, so even if we could get at the raw bytes, they wouldn't
// be useful. We *must* convert to UTF-8 before doing path matching.
// Unfortunate, but necessary.
match s.to_string_lossy() {
Cow::Owned(s) => Cow::Owned(s.into_bytes()),
Cow::Borrowed(s) => Cow::Borrowed(s.as_bytes()),
}
}
/// Normalizes a path to use `/` as a separator everywhere, even on platforms
/// that recognize other characters as separators.
#[cfg(unix)]
@@ -129,7 +83,8 @@ pub fn normalize_path(mut path: Cow<[u8]>) -> Cow<[u8]> {
#[cfg(test)]
mod tests {
use std::borrow::Cow;
use std::ffi::OsStr;
use bstr::{B, ByteVec};
use super::{file_name_ext, normalize_path};
@@ -137,8 +92,9 @@ mod tests {
($name:ident, $file_name:expr, $ext:expr) => {
#[test]
fn $name() {
let got = file_name_ext(OsStr::new($file_name));
assert_eq!($ext.map(|s| Cow::Borrowed(s.as_bytes())), got);
let bs = Vec::from($file_name);
let got = file_name_ext(&Cow::Owned(bs));
assert_eq!($ext.map(|s| Cow::Borrowed(B(s))), got);
}
};
}
@@ -153,7 +109,8 @@ mod tests {
($name:ident, $path:expr, $expected:expr) => {
#[test]
fn $name() {
let got = normalize_path(Cow::Owned($path.to_vec()));
let bs = Vec::from_slice($path);
let got = normalize_path(Cow::Owned(bs));
assert_eq!($expected.to_vec(), got.into_owned());
}
};

26
grep-cli/Cargo.toml Normal file
View File

@@ -0,0 +1,26 @@
[package]
name = "grep-cli"
version = "0.1.3" #:version
authors = ["Andrew Gallant <jamslam@gmail.com>"]
description = """
Utilities for search oriented command line applications.
"""
documentation = "https://docs.rs/grep-cli"
homepage = "https://github.com/BurntSushi/ripgrep"
repository = "https://github.com/BurntSushi/ripgrep"
readme = "README.md"
keywords = ["regex", "grep", "cli", "utility", "util"]
license = "Unlicense/MIT"
[dependencies]
atty = "0.2.11"
bstr = "0.2.0"
globset = { version = "0.4.3", path = "../globset" }
lazy_static = "1.1.0"
log = "0.4.5"
regex = "1.1"
same-file = "1.0.4"
termcolor = "1.0.4"
[target.'cfg(windows)'.dependencies.winapi-util]
version = "0.1.1"

21
grep-cli/LICENSE-MIT Normal file
View File

@@ -0,0 +1,21 @@
The MIT License (MIT)
Copyright (c) 2015 Andrew Gallant
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.

38
grep-cli/README.md Normal file
View File

@@ -0,0 +1,38 @@
grep-cli
--------
A utility library that provides common routines desired in search oriented
command line applications. This includes, but is not limited to, parsing hex
escapes, detecting whether stdin is readable and more. To the extent possible,
this crate strives for compatibility across Windows, macOS and Linux.
[![Linux build status](https://api.travis-ci.org/BurntSushi/ripgrep.svg)](https://travis-ci.org/BurntSushi/ripgrep)
[![Windows build status](https://ci.appveyor.com/api/projects/status/github/BurntSushi/ripgrep?svg=true)](https://ci.appveyor.com/project/BurntSushi/ripgrep)
[![](https://img.shields.io/crates/v/grep-cli.svg)](https://crates.io/crates/grep-cli)
Dual-licensed under MIT or the [UNLICENSE](http://unlicense.org).
### Documentation
[https://docs.rs/grep-cli](https://docs.rs/grep-cli)
**NOTE:** You probably don't want to use this crate directly. Instead, you
should prefer the facade defined in the
[`grep`](https://docs.rs/grep)
crate.
### Usage
Add this to your `Cargo.toml`:
```toml
[dependencies]
grep-cli = "0.1"
```
and this to your crate root:
```rust
extern crate grep_cli;
```

24
grep-cli/UNLICENSE Normal file
View File

@@ -0,0 +1,24 @@
This is free and unencumbered software released into the public domain.
Anyone is free to copy, modify, publish, use, compile, sell, or
distribute this software, either in source code form or as a compiled
binary, for any purpose, commercial or non-commercial, and by any
means.
In jurisdictions that recognize copyright laws, the author or authors
of this software dedicate any and all copyright interest in the
software to the public domain. We make this dedication for the benefit
of the public at large and to the detriment of our heirs and
successors. We intend this dedication to be an overt act of
relinquishment in perpetuity of all present and future rights to this
software under copyright law.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
IN NO EVENT SHALL THE AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR
OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
OTHER DEALINGS IN THE SOFTWARE.
For more information, please refer to <http://unlicense.org/>

382
grep-cli/src/decompress.rs Normal file
View File

@@ -0,0 +1,382 @@
use std::ffi::{OsStr, OsString};
use std::fs::File;
use std::io;
use std::path::Path;
use std::process::Command;
use globset::{Glob, GlobSet, GlobSetBuilder};
use process::{CommandError, CommandReader, CommandReaderBuilder};
/// A builder for a matcher that determines which files get decompressed.
#[derive(Clone, Debug)]
pub struct DecompressionMatcherBuilder {
/// The commands for each matching glob.
commands: Vec<DecompressionCommand>,
/// Whether to include the default matching rules.
defaults: bool,
}
/// A representation of a single command for decompressing data
/// out-of-proccess.
#[derive(Clone, Debug)]
struct DecompressionCommand {
/// The glob that matches this command.
glob: String,
/// The command or binary name.
bin: OsString,
/// The arguments to invoke with the command.
args: Vec<OsString>,
}
impl Default for DecompressionMatcherBuilder {
fn default() -> DecompressionMatcherBuilder {
DecompressionMatcherBuilder::new()
}
}
impl DecompressionMatcherBuilder {
/// Create a new builder for configuring a decompression matcher.
pub fn new() -> DecompressionMatcherBuilder {
DecompressionMatcherBuilder {
commands: vec![],
defaults: true,
}
}
/// Build a matcher for determining how to decompress files.
///
/// If there was a problem compiling the matcher, then an error is
/// returned.
pub fn build(&self) -> Result<DecompressionMatcher, CommandError> {
let defaults =
if !self.defaults {
vec![]
} else {
default_decompression_commands()
};
let mut glob_builder = GlobSetBuilder::new();
let mut commands = vec![];
for decomp_cmd in defaults.iter().chain(&self.commands) {
let glob = Glob::new(&decomp_cmd.glob).map_err(|err| {
CommandError::io(io::Error::new(io::ErrorKind::Other, err))
})?;
glob_builder.add(glob);
commands.push(decomp_cmd.clone());
}
let globs = glob_builder.build().map_err(|err| {
CommandError::io(io::Error::new(io::ErrorKind::Other, err))
})?;
Ok(DecompressionMatcher { globs, commands })
}
/// When enabled, the default matching rules will be compiled into this
/// matcher before any other associations. When disabled, only the
/// rules explicitly given to this builder will be used.
///
/// This is enabled by default.
pub fn defaults(&mut self, yes: bool) -> &mut DecompressionMatcherBuilder {
self.defaults = yes;
self
}
/// Associates a glob with a command to decompress files matching the glob.
///
/// If multiple globs match the same file, then the most recently added
/// glob takes precedence.
///
/// The syntax for the glob is documented in the
/// [`globset` crate](https://docs.rs/globset/#syntax).
pub fn associate<P, I, A>(
&mut self,
glob: &str,
program: P,
args: I,
) -> &mut DecompressionMatcherBuilder
where P: AsRef<OsStr>,
I: IntoIterator<Item=A>,
A: AsRef<OsStr>,
{
let glob = glob.to_string();
let bin = program.as_ref().to_os_string();
let args = args
.into_iter()
.map(|a| a.as_ref().to_os_string())
.collect();
self.commands.push(DecompressionCommand { glob, bin, args });
self
}
}
/// A matcher for determining how to decompress files.
#[derive(Clone, Debug)]
pub struct DecompressionMatcher {
/// The set of globs to match. Each glob has a corresponding entry in
/// `commands`. When a glob matches, the corresponding command should be
/// used to perform out-of-process decompression.
globs: GlobSet,
/// The commands for each matching glob.
commands: Vec<DecompressionCommand>,
}
impl Default for DecompressionMatcher {
fn default() -> DecompressionMatcher {
DecompressionMatcher::new()
}
}
impl DecompressionMatcher {
/// Create a new matcher with default rules.
///
/// To add more matching rules, build a matcher with
/// [`DecompressionMatcherBuilder`](struct.DecompressionMatcherBuilder.html).
pub fn new() -> DecompressionMatcher {
DecompressionMatcherBuilder::new()
.build()
.expect("built-in matching rules should always compile")
}
/// Return a pre-built command based on the given file path that can
/// decompress its contents. If no such decompressor is known, then this
/// returns `None`.
///
/// If there are multiple possible commands matching the given path, then
/// the command added last takes precedence.
pub fn command<P: AsRef<Path>>(&self, path: P) -> Option<Command> {
for i in self.globs.matches(path).into_iter().rev() {
let decomp_cmd = &self.commands[i];
let mut cmd = Command::new(&decomp_cmd.bin);
cmd.args(&decomp_cmd.args);
return Some(cmd);
}
None
}
/// Returns true if and only if the given file path has at least one
/// matching command to perform decompression on.
pub fn has_command<P: AsRef<Path>>(&self, path: P) -> bool {
self.globs.is_match(path)
}
}
/// Configures and builds a streaming reader for decompressing data.
#[derive(Clone, Debug, Default)]
pub struct DecompressionReaderBuilder {
matcher: DecompressionMatcher,
command_builder: CommandReaderBuilder,
}
impl DecompressionReaderBuilder {
/// Create a new builder with the default configuration.
pub fn new() -> DecompressionReaderBuilder {
DecompressionReaderBuilder::default()
}
/// Build a new streaming reader for decompressing data.
///
/// If decompression is done out-of-process and if there was a problem
/// spawning the process, then its error is logged at the debug level and a
/// passthru reader is returned that does no decompression. This behavior
/// typically occurs when the given file path matches a decompression
/// command, but is executing in an environment where the decompression
/// command is not available.
///
/// If the given file path could not be matched with a decompression
/// strategy, then a passthru reader is returned that does no
/// decompression.
pub fn build<P: AsRef<Path>>(
&self,
path: P,
) -> Result<DecompressionReader, CommandError> {
let path = path.as_ref();
let mut cmd = match self.matcher.command(path) {
None => return DecompressionReader::new_passthru(path),
Some(cmd) => cmd,
};
cmd.arg(path);
match self.command_builder.build(&mut cmd) {
Ok(cmd_reader) => Ok(DecompressionReader { rdr: Ok(cmd_reader) }),
Err(err) => {
debug!(
"{}: error spawning command '{:?}': {} \
(falling back to uncompressed reader)",
path.display(),
cmd,
err,
);
DecompressionReader::new_passthru(path)
}
}
}
/// Set the matcher to use to look up the decompression command for each
/// file path.
///
/// A set of sensible rules is enabled by default. Setting this will
/// completely replace the current rules.
pub fn matcher(
&mut self,
matcher: DecompressionMatcher,
) -> &mut DecompressionReaderBuilder {
self.matcher = matcher;
self
}
/// Get the underlying matcher currently used by this builder.
pub fn get_matcher(&self) -> &DecompressionMatcher {
&self.matcher
}
/// When enabled, the reader will asynchronously read the contents of the
/// command's stderr output. When disabled, stderr is only read after the
/// stdout stream has been exhausted (or if the process quits with an error
/// code).
///
/// Note that when enabled, this may require launching an additional
/// thread in order to read stderr. This is done so that the process being
/// executed is never blocked from writing to stdout or stderr. If this is
/// disabled, then it is possible for the process to fill up the stderr
/// buffer and deadlock.
///
/// This is enabled by default.
pub fn async_stderr(
&mut self,
yes: bool,
) -> &mut DecompressionReaderBuilder {
self.command_builder.async_stderr(yes);
self
}
}
/// A streaming reader for decompressing the contents of a file.
///
/// The purpose of this reader is to provide a seamless way to decompress the
/// contents of file using existing tools in the current environment. This is
/// meant to be an alternative to using decompression libraries in favor of the
/// simplicity and portability of using external commands such as `gzip` and
/// `xz`. This does impose the overhead of spawning a process, so other means
/// for performing decompression should be sought if this overhead isn't
/// acceptable.
///
/// A decompression reader comes with a default set of matching rules that are
/// meant to associate file paths with the corresponding command to use to
/// decompress them. For example, a glob like `*.gz` matches gzip compressed
/// files with the command `gzip -d -c`. If a file path does not match any
/// existing rules, or if it matches a rule whose command does not exist in the
/// current environment, then the decompression reader passes through the
/// contents of the underlying file without doing any decompression.
///
/// The default matching rules are probably good enough for most cases, and if
/// they require revision, pull requests are welcome. In cases where they must
/// be changed or extended, they can be customized through the use of
/// [`DecompressionMatcherBuilder`](struct.DecompressionMatcherBuilder.html)
/// and
/// [`DecompressionReaderBuilder`](struct.DecompressionReaderBuilder.html).
///
/// By default, this reader will asynchronously read the processes' stderr.
/// This prevents subtle deadlocking bugs for noisy processes that write a lot
/// to stderr. Currently, the entire contents of stderr is read on to the heap.
///
/// # Example
///
/// This example shows how to read the decompressed contents of a file without
/// needing to explicitly choose the decompression command to run.
///
/// Note that if you need to decompress multiple files, it is better to use
/// `DecompressionReaderBuilder`, which will amortize the cost of compiling the
/// matcher.
///
/// ```no_run
/// use std::io::Read;
/// use std::process::Command;
/// use grep_cli::DecompressionReader;
///
/// # fn example() -> Result<(), Box<::std::error::Error>> {
/// let mut rdr = DecompressionReader::new("/usr/share/man/man1/ls.1.gz")?;
/// let mut contents = vec![];
/// rdr.read_to_end(&mut contents)?;
/// # Ok(()) }
/// ```
#[derive(Debug)]
pub struct DecompressionReader {
rdr: Result<CommandReader, File>,
}
impl DecompressionReader {
/// Build a new streaming reader for decompressing data.
///
/// If decompression is done out-of-process and if there was a problem
/// spawning the process, then its error is returned.
///
/// If the given file path could not be matched with a decompression
/// strategy, then a passthru reader is returned that does no
/// decompression.
///
/// This uses the default matching rules for determining how to decompress
/// the given file. To change those matching rules, use
/// [`DecompressionReaderBuilder`](struct.DecompressionReaderBuilder.html)
/// and
/// [`DecompressionMatcherBuilder`](struct.DecompressionMatcherBuilder.html).
///
/// When creating readers for many paths. it is better to use the builder
/// since it will amortize the cost of constructing the matcher.
pub fn new<P: AsRef<Path>>(
path: P,
) -> Result<DecompressionReader, CommandError> {
DecompressionReaderBuilder::new().build(path)
}
/// Creates a new "passthru" decompression reader that reads from the file
/// corresponding to the given path without doing decompression and without
/// executing another process.
fn new_passthru(path: &Path) -> Result<DecompressionReader, CommandError> {
let file = File::open(path)?;
Ok(DecompressionReader { rdr: Err(file) })
}
}
impl io::Read for DecompressionReader {
fn read(&mut self, buf: &mut [u8]) -> io::Result<usize> {
match self.rdr {
Ok(ref mut rdr) => rdr.read(buf),
Err(ref mut rdr) => rdr.read(buf),
}
}
}
fn default_decompression_commands() -> Vec<DecompressionCommand> {
const ARGS_GZIP: &[&str] = &["gzip", "-d", "-c"];
const ARGS_BZIP: &[&str] = &["bzip2", "-d", "-c"];
const ARGS_XZ: &[&str] = &["xz", "-d", "-c"];
const ARGS_LZ4: &[&str] = &["lz4", "-d", "-c"];
const ARGS_LZMA: &[&str] = &["xz", "--format=lzma", "-d", "-c"];
const ARGS_BROTLI: &[&str] = &["brotli", "-d", "-c"];
const ARGS_ZSTD: &[&str] = &["zstd", "-q", "-d", "-c"];
fn cmd(glob: &str, args: &[&str]) -> DecompressionCommand {
DecompressionCommand {
glob: glob.to_string(),
bin: OsStr::new(&args[0]).to_os_string(),
args: args
.iter()
.skip(1)
.map(|s| OsStr::new(s).to_os_string())
.collect(),
}
}
vec![
cmd("*.gz", ARGS_GZIP),
cmd("*.tgz", ARGS_GZIP),
cmd("*.bz2", ARGS_BZIP),
cmd("*.tbz2", ARGS_BZIP),
cmd("*.xz", ARGS_XZ),
cmd("*.txz", ARGS_XZ),
cmd("*.lz4", ARGS_LZ4),
cmd("*.lzma", ARGS_LZMA),
cmd("*.br", ARGS_BROTLI),
cmd("*.zst", ARGS_ZSTD),
cmd("*.zstd", ARGS_ZSTD),
]
}

262
grep-cli/src/escape.rs Normal file
View File

@@ -0,0 +1,262 @@
use std::ffi::OsStr;
use std::str;
use bstr::{ByteSlice, ByteVec};
/// A single state in the state machine used by `unescape`.
#[derive(Clone, Copy, Eq, PartialEq)]
enum State {
/// The state after seeing a `\`.
Escape,
/// The state after seeing a `\x`.
HexFirst,
/// The state after seeing a `\x[0-9A-Fa-f]`.
HexSecond(char),
/// Default state.
Literal,
}
/// Escapes arbitrary bytes into a human readable string.
///
/// This converts `\t`, `\r` and `\n` into their escaped forms. It also
/// converts the non-printable subset of ASCII in addition to invalid UTF-8
/// bytes to hexadecimal escape sequences. Everything else is left as is.
///
/// The dual of this routine is [`unescape`](fn.unescape.html).
///
/// # Example
///
/// This example shows how to convert a byte string that contains a `\n` and
/// invalid UTF-8 bytes into a `String`.
///
/// Pay special attention to the use of raw strings. That is, `r"\n"` is
/// equivalent to `"\\n"`.
///
/// ```
/// use grep_cli::escape;
///
/// assert_eq!(r"foo\nbar\xFFbaz", escape(b"foo\nbar\xFFbaz"));
/// ```
pub fn escape(bytes: &[u8]) -> String {
let mut escaped = String::new();
for (s, e, ch) in bytes.char_indices() {
if ch == '\u{FFFD}' {
for b in bytes[s..e].bytes() {
escape_byte(b, &mut escaped);
}
} else {
escape_char(ch, &mut escaped);
}
}
escaped
}
/// Escapes an OS string into a human readable string.
///
/// This is like [`escape`](fn.escape.html), but accepts an OS string.
pub fn escape_os(string: &OsStr) -> String {
escape(Vec::from_os_str_lossy(string).as_bytes())
}
/// Unescapes a string.
///
/// It supports a limited set of escape sequences:
///
/// * `\t`, `\r` and `\n` are mapped to their corresponding ASCII bytes.
/// * `\xZZ` hexadecimal escapes are mapped to their byte.
///
/// Everything else is left as is, including non-hexadecimal escapes like
/// `\xGG`.
///
/// This is useful when it is desirable for a command line argument to be
/// capable of specifying arbitrary bytes or otherwise make it easier to
/// specify non-printable characters.
///
/// The dual of this routine is [`escape`](fn.escape.html).
///
/// # Example
///
/// This example shows how to convert an escaped string (which is valid UTF-8)
/// into a corresponding sequence of bytes. Each escape sequence is mapped to
/// its bytes, which may include invalid UTF-8.
///
/// Pay special attention to the use of raw strings. That is, `r"\n"` is
/// equivalent to `"\\n"`.
///
/// ```
/// use grep_cli::unescape;
///
/// assert_eq!(&b"foo\nbar\xFFbaz"[..], &*unescape(r"foo\nbar\xFFbaz"));
/// ```
pub fn unescape(s: &str) -> Vec<u8> {
use self::State::*;
let mut bytes = vec![];
let mut state = Literal;
for c in s.chars() {
match state {
Escape => {
match c {
'\\' => { bytes.push(b'\\'); state = Literal; }
'n' => { bytes.push(b'\n'); state = Literal; }
'r' => { bytes.push(b'\r'); state = Literal; }
't' => { bytes.push(b'\t'); state = Literal; }
'x' => { state = HexFirst; }
c => {
bytes.extend(format!(r"\{}", c).into_bytes());
state = Literal;
}
}
}
HexFirst => {
match c {
'0'..='9' | 'A'..='F' | 'a'..='f' => {
state = HexSecond(c);
}
c => {
bytes.extend(format!(r"\x{}", c).into_bytes());
state = Literal;
}
}
}
HexSecond(first) => {
match c {
'0'..='9' | 'A'..='F' | 'a'..='f' => {
let ordinal = format!("{}{}", first, c);
let byte = u8::from_str_radix(&ordinal, 16).unwrap();
bytes.push(byte);
state = Literal;
}
c => {
let original = format!(r"\x{}{}", first, c);
bytes.extend(original.into_bytes());
state = Literal;
}
}
}
Literal => {
match c {
'\\' => { state = Escape; }
c => { bytes.extend(c.to_string().as_bytes()); }
}
}
}
}
match state {
Escape => bytes.push(b'\\'),
HexFirst => bytes.extend(b"\\x"),
HexSecond(c) => bytes.extend(format!("\\x{}", c).into_bytes()),
Literal => {}
}
bytes
}
/// Unescapes an OS string.
///
/// This is like [`unescape`](fn.unescape.html), but accepts an OS string.
///
/// Note that this first lossily decodes the given OS string as UTF-8. That
/// is, an escaped string (the thing given) should be valid UTF-8.
pub fn unescape_os(string: &OsStr) -> Vec<u8> {
unescape(&string.to_string_lossy())
}
/// Adds the given codepoint to the given string, escaping it if necessary.
fn escape_char(cp: char, into: &mut String) {
if cp.is_ascii() {
escape_byte(cp as u8, into);
} else {
into.push(cp);
}
}
/// Adds the given byte to the given string, escaping it if necessary.
fn escape_byte(byte: u8, into: &mut String) {
match byte {
0x21..=0x5B | 0x5D..=0x7D => into.push(byte as char),
b'\n' => into.push_str(r"\n"),
b'\r' => into.push_str(r"\r"),
b'\t' => into.push_str(r"\t"),
b'\\' => into.push_str(r"\\"),
_ => into.push_str(&format!(r"\x{:02X}", byte)),
}
}
#[cfg(test)]
mod tests {
use super::{escape, unescape};
fn b(bytes: &'static [u8]) -> Vec<u8> {
bytes.to_vec()
}
#[test]
fn empty() {
assert_eq!(b(b""), unescape(r""));
assert_eq!(r"", escape(b""));
}
#[test]
fn backslash() {
assert_eq!(b(b"\\"), unescape(r"\\"));
assert_eq!(r"\\", escape(b"\\"));
}
#[test]
fn nul() {
assert_eq!(b(b"\x00"), unescape(r"\x00"));
assert_eq!(r"\x00", escape(b"\x00"));
}
#[test]
fn nl() {
assert_eq!(b(b"\n"), unescape(r"\n"));
assert_eq!(r"\n", escape(b"\n"));
}
#[test]
fn tab() {
assert_eq!(b(b"\t"), unescape(r"\t"));
assert_eq!(r"\t", escape(b"\t"));
}
#[test]
fn carriage() {
assert_eq!(b(b"\r"), unescape(r"\r"));
assert_eq!(r"\r", escape(b"\r"));
}
#[test]
fn nothing_simple() {
assert_eq!(b(b"\\a"), unescape(r"\a"));
assert_eq!(b(b"\\a"), unescape(r"\\a"));
assert_eq!(r"\\a", escape(b"\\a"));
}
#[test]
fn nothing_hex0() {
assert_eq!(b(b"\\x"), unescape(r"\x"));
assert_eq!(b(b"\\x"), unescape(r"\\x"));
assert_eq!(r"\\x", escape(b"\\x"));
}
#[test]
fn nothing_hex1() {
assert_eq!(b(b"\\xz"), unescape(r"\xz"));
assert_eq!(b(b"\\xz"), unescape(r"\\xz"));
assert_eq!(r"\\xz", escape(b"\\xz"));
}
#[test]
fn nothing_hex2() {
assert_eq!(b(b"\\xzz"), unescape(r"\xzz"));
assert_eq!(b(b"\\xzz"), unescape(r"\\xzz"));
assert_eq!(r"\\xzz", escape(b"\\xzz"));
}
#[test]
fn invalid_utf8() {
assert_eq!(r"\xFF", escape(b"\xFF"));
assert_eq!(r"a\xFFb", escape(b"a\xFFb"));
}
}

171
grep-cli/src/human.rs Normal file
View File

@@ -0,0 +1,171 @@
use std::error;
use std::fmt;
use std::io;
use std::num::ParseIntError;
use regex::Regex;
/// An error that occurs when parsing a human readable size description.
///
/// This error provides a end user friendly message describing why the
/// description coudln't be parsed and what the expected format is.
#[derive(Clone, Debug, Eq, PartialEq)]
pub struct ParseSizeError {
original: String,
kind: ParseSizeErrorKind,
}
#[derive(Clone, Debug, Eq, PartialEq)]
enum ParseSizeErrorKind {
InvalidFormat,
InvalidInt(ParseIntError),
Overflow,
}
impl ParseSizeError {
fn format(original: &str) -> ParseSizeError {
ParseSizeError {
original: original.to_string(),
kind: ParseSizeErrorKind::InvalidFormat,
}
}
fn int(original: &str, err: ParseIntError) -> ParseSizeError {
ParseSizeError {
original: original.to_string(),
kind: ParseSizeErrorKind::InvalidInt(err),
}
}
fn overflow(original: &str) -> ParseSizeError {
ParseSizeError {
original: original.to_string(),
kind: ParseSizeErrorKind::Overflow,
}
}
}
impl error::Error for ParseSizeError {
fn description(&self) -> &str { "invalid size" }
}
impl fmt::Display for ParseSizeError {
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
use self::ParseSizeErrorKind::*;
match self.kind {
InvalidFormat => {
write!(
f,
"invalid format for size '{}', which should be a sequence \
of digits followed by an optional 'K', 'M' or 'G' \
suffix",
self.original
)
}
InvalidInt(ref err) => {
write!(
f,
"invalid integer found in size '{}': {}",
self.original,
err
)
}
Overflow => {
write!(f, "size too big in '{}'", self.original)
}
}
}
}
impl From<ParseSizeError> for io::Error {
fn from(size_err: ParseSizeError) -> io::Error {
io::Error::new(io::ErrorKind::Other, size_err)
}
}
/// Parse a human readable size like `2M` into a corresponding number of bytes.
///
/// Supported size suffixes are `K` (for kilobyte), `M` (for megabyte) and `G`
/// (for gigabyte). If a size suffix is missing, then the size is interpreted
/// as bytes. If the size is too big to fit into a `u64`, then this returns an
/// error.
///
/// Additional suffixes may be added over time.
pub fn parse_human_readable_size(size: &str) -> Result<u64, ParseSizeError> {
lazy_static! {
// Normally I'd just parse something this simple by hand to avoid the
// regex dep, but we bring regex in any way for glob matching, so might
// as well use it.
static ref RE: Regex = Regex::new(r"^([0-9]+)([KMG])?$").unwrap();
}
let caps = match RE.captures(size) {
Some(caps) => caps,
None => return Err(ParseSizeError::format(size)),
};
let value: u64 = caps[1].parse().map_err(|err| {
ParseSizeError::int(size, err)
})?;
let suffix = match caps.get(2) {
None => return Ok(value),
Some(cap) => cap.as_str(),
};
let bytes = match suffix {
"K" => value.checked_mul(1<<10),
"M" => value.checked_mul(1<<20),
"G" => value.checked_mul(1<<30),
// Because if the regex matches this group, it must be [KMG].
_ => unreachable!(),
};
bytes.ok_or_else(|| ParseSizeError::overflow(size))
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn suffix_none() {
let x = parse_human_readable_size("123").unwrap();
assert_eq!(123, x);
}
#[test]
fn suffix_k() {
let x = parse_human_readable_size("123K").unwrap();
assert_eq!(123 * (1<<10), x);
}
#[test]
fn suffix_m() {
let x = parse_human_readable_size("123M").unwrap();
assert_eq!(123 * (1<<20), x);
}
#[test]
fn suffix_g() {
let x = parse_human_readable_size("123G").unwrap();
assert_eq!(123 * (1<<30), x);
}
#[test]
fn invalid_empty() {
assert!(parse_human_readable_size("").is_err());
}
#[test]
fn invalid_non_digit() {
assert!(parse_human_readable_size("a").is_err());
}
#[test]
fn invalid_overflow() {
assert!(parse_human_readable_size("9999999999999999G").is_err());
}
#[test]
fn invalid_suffix() {
assert!(parse_human_readable_size("123T").is_err());
}
}

252
grep-cli/src/lib.rs Normal file
View File

@@ -0,0 +1,252 @@
/*!
This crate provides common routines used in command line applications, with a
focus on routines useful for search oriented applications. As a utility
library, there is no central type or function. However, a key focus of this
crate is to improve failure modes and provide user friendly error messages
when things go wrong.
To the best extent possible, everything in this crate works on Windows, macOS
and Linux.
# Standard I/O
The
[`is_readable_stdin`](fn.is_readable_stdin.html),
[`is_tty_stderr`](fn.is_tty_stderr.html),
[`is_tty_stdin`](fn.is_tty_stdin.html)
and
[`is_tty_stdout`](fn.is_tty_stdout.html)
routines query aspects of standard I/O. `is_readable_stdin` determines whether
stdin can be usefully read from, while the `tty` methods determine whether a
tty is attached to stdin/stdout/stderr.
`is_readable_stdin` is useful when writing an application that changes behavior
based on whether the application was invoked with data on stdin. For example,
`rg foo` might recursively search the current working directory for
occurrences of `foo`, but `rg foo < file` might only search the contents of
`file`.
The `tty` methods are useful for similar reasons. Namely, commands like `ls`
will change their output depending on whether they are printing to a terminal
or not. For example, `ls` shows a file on each line when stdout is redirected
to a file or a pipe, but condenses the output to show possibly many files on
each line when stdout is connected to a tty.
# Coloring and buffering
The
[`stdout`](fn.stdout.html),
[`stdout_buffered_block`](fn.stdout_buffered_block.html)
and
[`stdout_buffered_line`](fn.stdout_buffered_line.html)
routines are alternative constructors for
[`StandardStream`](struct.StandardStream.html).
A `StandardStream` implements `termcolor::WriteColor`, which provides a way
to emit colors to terminals. Its key use is the encapsulation of buffering
style. Namely, `stdout` will return a line buffered `StandardStream` if and
only if stdout is connected to a tty, and will otherwise return a block
buffered `StandardStream`. Line buffering is important for use with a tty
because it typically decreases the latency at which the end user sees output.
Block buffering is used otherwise because it is faster, and redirecting stdout
to a file typically doesn't benefit from the decreased latency that line
buffering provides.
The `stdout_buffered_block` and `stdout_buffered_line` can be used to
explicitly set the buffering strategy regardless of whether stdout is connected
to a tty or not.
# Escaping
The
[`escape`](fn.escape.html),
[`escape_os`](fn.escape_os.html),
[`unescape`](fn.unescape.html)
and
[`unescape_os`](fn.unescape_os.html)
routines provide a user friendly way of dealing with UTF-8 encoded strings that
can express arbitrary bytes. For example, you might want to accept a string
containing arbitrary bytes as a command line argument, but most interactive
shells make such strings difficult to type. Instead, we can ask users to use
escape sequences.
For example, `a\xFFz` is itself a valid UTF-8 string corresponding to the
following bytes:
```ignore
[b'a', b'\\', b'x', b'F', b'F', b'z']
```
However, we can
interpret `\xFF` as an escape sequence with the `unescape`/`unescape_os`
routines, which will yield
```ignore
[b'a', b'\xFF', b'z']
```
instead. For example:
```
use grep_cli::unescape;
// Note the use of a raw string!
assert_eq!(vec![b'a', b'\xFF', b'z'], unescape(r"a\xFFz"));
```
The `escape`/`escape_os` routines provide the reverse transformation, which
makes it easy to show user friendly error messages involving arbitrary bytes.
# Building patterns
Typically, regular expression patterns must be valid UTF-8. However, command
line arguments aren't guaranteed to be valid UTF-8. Unfortunately, the
standard library's UTF-8 conversion functions from `OsStr`s do not provide
good error messages. However, the
[`pattern_from_bytes`](fn.pattern_from_bytes.html)
and
[`pattern_from_os`](fn.pattern_from_os.html)
do, including reporting exactly where the first invalid UTF-8 byte is seen.
Additionally, it can be useful to read patterns from a file while reporting
good error messages that include line numbers. The
[`patterns_from_path`](fn.patterns_from_path.html),
[`patterns_from_reader`](fn.patterns_from_reader.html)
and
[`patterns_from_stdin`](fn.patterns_from_stdin.html)
routines do just that. If any pattern is found that is invalid UTF-8, then the
error includes the file path (if available) along with the line number and the
byte offset at which the first invalid UTF-8 byte was observed.
# Read process output
Sometimes a command line application needs to execute other processes and read
its stdout in a streaming fashion. The
[`CommandReader`](struct.CommandReader.html)
provides this functionality with an explicit goal of improving failure modes.
In particular, if the process exits with an error code, then stderr is read
and converted into a normal Rust error to show to end users. This makes the
underlying failure modes explicit and gives more information to end users for
debugging the problem.
As a special case,
[`DecompressionReader`](struct.DecompressionReader.html)
provides a way to decompress arbitrary files by matching their file extensions
up with corresponding decompression programs (such as `gzip` and `xz`). This
is useful as a means of performing simplistic decompression in a portable
manner without binding to specific compression libraries. This does come with
some overhead though, so if you need to decompress lots of small files, this
may not be an appropriate convenience to use.
Each reader has a corresponding builder for additional configuration, such as
whether to read stderr asynchronously in order to avoid deadlock (which is
enabled by default).
# Miscellaneous parsing
The
[`parse_human_readable_size`](fn.parse_human_readable_size.html)
routine parses strings like `2M` and converts them to the corresponding number
of bytes (`2 * 1<<20` in this case). If an invalid size is found, then a good
error message is crafted that typically tells the user how to fix the problem.
*/
#![deny(missing_docs)]
extern crate atty;
extern crate bstr;
extern crate globset;
#[macro_use]
extern crate lazy_static;
#[macro_use]
extern crate log;
extern crate regex;
extern crate same_file;
extern crate termcolor;
#[cfg(windows)]
extern crate winapi_util;
mod decompress;
mod escape;
mod human;
mod pattern;
mod process;
mod wtr;
pub use decompress::{
DecompressionMatcher, DecompressionMatcherBuilder,
DecompressionReader, DecompressionReaderBuilder,
};
pub use escape::{escape, escape_os, unescape, unescape_os};
pub use human::{ParseSizeError, parse_human_readable_size};
pub use pattern::{
InvalidPatternError,
pattern_from_os, pattern_from_bytes,
patterns_from_path, patterns_from_reader, patterns_from_stdin,
};
pub use process::{CommandError, CommandReader, CommandReaderBuilder};
pub use wtr::{
StandardStream,
stdout, stdout_buffered_line, stdout_buffered_block,
};
/// Returns true if and only if stdin is believed to be readable.
///
/// When stdin is readable, command line programs may choose to behave
/// differently than when stdin is not readable. For example, `command foo`
/// might search the current directory for occurrences of `foo` where as
/// `command foo < some-file` or `cat some-file | command foo` might instead
/// only search stdin for occurrences of `foo`.
pub fn is_readable_stdin() -> bool {
#[cfg(unix)]
fn imp() -> bool {
use std::os::unix::fs::FileTypeExt;
use same_file::Handle;
let ft = match Handle::stdin().and_then(|h| h.as_file().metadata()) {
Err(_) => return false,
Ok(md) => md.file_type(),
};
ft.is_file() || ft.is_fifo()
}
#[cfg(windows)]
fn imp() -> bool {
use winapi_util as winutil;
winutil::file::typ(winutil::HandleRef::stdin())
.map(|t| t.is_disk() || t.is_pipe())
.unwrap_or(false)
}
!is_tty_stdin() && imp()
}
/// Returns true if and only if stdin is believed to be connectted to a tty
/// or a console.
pub fn is_tty_stdin() -> bool {
atty::is(atty::Stream::Stdin)
}
/// Returns true if and only if stdout is believed to be connectted to a tty
/// or a console.
///
/// This is useful for when you want your command line program to produce
/// different output depending on whether it's printing directly to a user's
/// terminal or whether it's being redirected somewhere else. For example,
/// implementations of `ls` will often show one item per line when stdout is
/// redirected, but will condensed output when printing to a tty.
pub fn is_tty_stdout() -> bool {
atty::is(atty::Stream::Stdout)
}
/// Returns true if and only if stderr is believed to be connectted to a tty
/// or a console.
pub fn is_tty_stderr() -> bool {
atty::is(atty::Stream::Stderr)
}

201
grep-cli/src/pattern.rs Normal file
View File

@@ -0,0 +1,201 @@
use std::error;
use std::ffi::OsStr;
use std::fmt;
use std::fs::File;
use std::io;
use std::path::Path;
use std::str;
use bstr::io::BufReadExt;
use escape::{escape, escape_os};
/// An error that occurs when a pattern could not be converted to valid UTF-8.
///
/// The purpose of this error is to give a more targeted failure mode for
/// patterns written by end users that are not valid UTF-8.
#[derive(Clone, Debug, Eq, PartialEq)]
pub struct InvalidPatternError {
original: String,
valid_up_to: usize,
}
impl InvalidPatternError {
/// Returns the index in the given string up to which valid UTF-8 was
/// verified.
pub fn valid_up_to(&self) -> usize {
self.valid_up_to
}
}
impl error::Error for InvalidPatternError {
fn description(&self) -> &str { "invalid pattern" }
}
impl fmt::Display for InvalidPatternError {
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
write!(
f,
"found invalid UTF-8 in pattern at byte offset {} \
(use hex escape sequences to match arbitrary bytes \
in a pattern, e.g., \\xFF): '{}'",
self.valid_up_to,
self.original,
)
}
}
impl From<InvalidPatternError> for io::Error {
fn from(paterr: InvalidPatternError) -> io::Error {
io::Error::new(io::ErrorKind::Other, paterr)
}
}
/// Convert an OS string into a regular expression pattern.
///
/// This conversion fails if the given pattern is not valid UTF-8, in which
/// case, a targeted error with more information about where the invalid UTF-8
/// occurs is given. The error also suggests the use of hex escape sequences,
/// which are supported by many regex engines.
pub fn pattern_from_os(pattern: &OsStr) -> Result<&str, InvalidPatternError> {
pattern.to_str().ok_or_else(|| {
let valid_up_to = pattern
.to_string_lossy()
.find('\u{FFFD}')
.expect("a Unicode replacement codepoint for invalid UTF-8");
InvalidPatternError {
original: escape_os(pattern),
valid_up_to: valid_up_to,
}
})
}
/// Convert arbitrary bytes into a regular expression pattern.
///
/// This conversion fails if the given pattern is not valid UTF-8, in which
/// case, a targeted error with more information about where the invalid UTF-8
/// occurs is given. The error also suggests the use of hex escape sequences,
/// which are supported by many regex engines.
pub fn pattern_from_bytes(
pattern: &[u8],
) -> Result<&str, InvalidPatternError> {
str::from_utf8(pattern).map_err(|err| {
InvalidPatternError {
original: escape(pattern),
valid_up_to: err.valid_up_to(),
}
})
}
/// Read patterns from a file path, one per line.
///
/// If there was a problem reading or if any of the patterns contain invalid
/// UTF-8, then an error is returned. If there was a problem with a specific
/// pattern, then the error message will include the line number and the file
/// path.
pub fn patterns_from_path<P: AsRef<Path>>(path: P) -> io::Result<Vec<String>> {
let path = path.as_ref();
let file = File::open(path).map_err(|err| {
io::Error::new(
io::ErrorKind::Other,
format!("{}: {}", path.display(), err),
)
})?;
patterns_from_reader(file).map_err(|err| {
io::Error::new(
io::ErrorKind::Other,
format!("{}:{}", path.display(), err),
)
})
}
/// Read patterns from stdin, one per line.
///
/// If there was a problem reading or if any of the patterns contain invalid
/// UTF-8, then an error is returned. If there was a problem with a specific
/// pattern, then the error message will include the line number and the fact
/// that it came from stdin.
pub fn patterns_from_stdin() -> io::Result<Vec<String>> {
let stdin = io::stdin();
let locked = stdin.lock();
patterns_from_reader(locked).map_err(|err| {
io::Error::new(
io::ErrorKind::Other,
format!("<stdin>:{}", err),
)
})
}
/// Read patterns from any reader, one per line.
///
/// If there was a problem reading or if any of the patterns contain invalid
/// UTF-8, then an error is returned. If there was a problem with a specific
/// pattern, then the error message will include the line number.
///
/// Note that this routine uses its own internal buffer, so the caller should
/// not provide their own buffered reader if possible.
///
/// # Example
///
/// This shows how to parse patterns, one per line.
///
/// ```
/// use grep_cli::patterns_from_reader;
///
/// # fn example() -> Result<(), Box<::std::error::Error>> {
/// let patterns = "\
/// foo
/// bar\\s+foo
/// [a-z]{3}
/// ";
///
/// assert_eq!(patterns_from_reader(patterns.as_bytes())?, vec![
/// r"foo",
/// r"bar\s+foo",
/// r"[a-z]{3}",
/// ]);
/// # Ok(()) }
/// ```
pub fn patterns_from_reader<R: io::Read>(rdr: R) -> io::Result<Vec<String>> {
let mut patterns = vec![];
let mut line_number = 0;
io::BufReader::new(rdr).for_byte_line(|line| {
line_number += 1;
match pattern_from_bytes(line) {
Ok(pattern) => {
patterns.push(pattern.to_string());
Ok(true)
}
Err(err) => {
Err(io::Error::new(
io::ErrorKind::Other,
format!("{}: {}", line_number, err),
))
}
}
})?;
Ok(patterns)
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn bytes() {
let pat = b"abc\xFFxyz";
let err = pattern_from_bytes(pat).unwrap_err();
assert_eq!(3, err.valid_up_to());
}
#[test]
#[cfg(unix)]
fn os() {
use std::os::unix::ffi::OsStrExt;
use std::ffi::OsStr;
let pat = OsStr::from_bytes(b"abc\xFFxyz");
let err = pattern_from_os(pat).unwrap_err();
assert_eq!(3, err.valid_up_to());
}
}

267
grep-cli/src/process.rs Normal file
View File

@@ -0,0 +1,267 @@
use std::error;
use std::fmt;
use std::io::{self, Read};
use std::iter;
use std::process;
use std::thread::{self, JoinHandle};
/// An error that can occur while running a command and reading its output.
///
/// This error can be seamlessly converted to an `io::Error` via a `From`
/// implementation.
#[derive(Debug)]
pub struct CommandError {
kind: CommandErrorKind,
}
#[derive(Debug)]
enum CommandErrorKind {
Io(io::Error),
Stderr(Vec<u8>),
}
impl CommandError {
/// Create an error from an I/O error.
pub(crate) fn io(ioerr: io::Error) -> CommandError {
CommandError { kind: CommandErrorKind::Io(ioerr) }
}
/// Create an error from the contents of stderr (which may be empty).
pub(crate) fn stderr(bytes: Vec<u8>) -> CommandError {
CommandError { kind: CommandErrorKind::Stderr(bytes) }
}
}
impl error::Error for CommandError {
fn description(&self) -> &str { "command error" }
}
impl fmt::Display for CommandError {
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
match self.kind {
CommandErrorKind::Io(ref e) => e.fmt(f),
CommandErrorKind::Stderr(ref bytes) => {
let msg = String::from_utf8_lossy(bytes);
if msg.trim().is_empty() {
write!(f, "<stderr is empty>")
} else {
let div = iter::repeat('-').take(79).collect::<String>();
write!(f, "\n{div}\n{msg}\n{div}", div=div, msg=msg.trim())
}
}
}
}
}
impl From<io::Error> for CommandError {
fn from(ioerr: io::Error) -> CommandError {
CommandError { kind: CommandErrorKind::Io(ioerr) }
}
}
impl From<CommandError> for io::Error {
fn from(cmderr: CommandError) -> io::Error {
match cmderr.kind {
CommandErrorKind::Io(ioerr) => ioerr,
CommandErrorKind::Stderr(_) => {
io::Error::new(io::ErrorKind::Other, cmderr)
}
}
}
}
/// Configures and builds a streaming reader for process output.
#[derive(Clone, Debug, Default)]
pub struct CommandReaderBuilder {
async_stderr: bool,
}
impl CommandReaderBuilder {
/// Create a new builder with the default configuration.
pub fn new() -> CommandReaderBuilder {
CommandReaderBuilder::default()
}
/// Build a new streaming reader for the given command's output.
///
/// The caller should set everything that's required on the given command
/// before building a reader, such as its arguments, environment and
/// current working directory. Settings such as the stdout and stderr (but
/// not stdin) pipes will be overridden so that they can be controlled by
/// the reader.
///
/// If there was a problem spawning the given command, then its error is
/// returned.
pub fn build(
&self,
command: &mut process::Command,
) -> Result<CommandReader, CommandError> {
let mut child = command
.stdout(process::Stdio::piped())
.stderr(process::Stdio::piped())
.spawn()?;
let stdout = child.stdout.take().unwrap();
let stderr =
if self.async_stderr {
StderrReader::async(child.stderr.take().unwrap())
} else {
StderrReader::sync(child.stderr.take().unwrap())
};
Ok(CommandReader {
child: child,
stdout: stdout,
stderr: stderr,
done: false,
})
}
/// When enabled, the reader will asynchronously read the contents of the
/// command's stderr output. When disabled, stderr is only read after the
/// stdout stream has been exhausted (or if the process quits with an error
/// code).
///
/// Note that when enabled, this may require launching an additional
/// thread in order to read stderr. This is done so that the process being
/// executed is never blocked from writing to stdout or stderr. If this is
/// disabled, then it is possible for the process to fill up the stderr
/// buffer and deadlock.
///
/// This is enabled by default.
pub fn async_stderr(&mut self, yes: bool) -> &mut CommandReaderBuilder {
self.async_stderr = yes;
self
}
}
/// A streaming reader for a command's output.
///
/// The purpose of this reader is to provide an easy way to execute processes
/// whose stdout is read in a streaming way while also making the processes'
/// stderr available when the process fails with an exit code. This makes it
/// possible to execute processes while surfacing the underlying failure mode
/// in the case of an error.
///
/// Moreover, by default, this reader will asynchronously read the processes'
/// stderr. This prevents subtle deadlocking bugs for noisy processes that
/// write a lot to stderr. Currently, the entire contents of stderr is read
/// on to the heap.
///
/// # Example
///
/// This example shows how to invoke `gzip` to decompress the contents of a
/// file. If the `gzip` command reports a failing exit status, then its stderr
/// is returned as an error.
///
/// ```no_run
/// use std::io::Read;
/// use std::process::Command;
/// use grep_cli::CommandReader;
///
/// # fn example() -> Result<(), Box<::std::error::Error>> {
/// let mut cmd = Command::new("gzip");
/// cmd.arg("-d").arg("-c").arg("/usr/share/man/man1/ls.1.gz");
///
/// let mut rdr = CommandReader::new(&mut cmd)?;
/// let mut contents = vec![];
/// rdr.read_to_end(&mut contents)?;
/// # Ok(()) }
/// ```
#[derive(Debug)]
pub struct CommandReader {
child: process::Child,
stdout: process::ChildStdout,
stderr: StderrReader,
done: bool,
}
impl CommandReader {
/// Create a new streaming reader for the given command using the default
/// configuration.
///
/// The caller should set everything that's required on the given command
/// before building a reader, such as its arguments, environment and
/// current working directory. Settings such as the stdout and stderr (but
/// not stdin) pipes will be overridden so that they can be controlled by
/// the reader.
///
/// If there was a problem spawning the given command, then its error is
/// returned.
///
/// If the caller requires additional configuration for the reader
/// returned, then use
/// [`CommandReaderBuilder`](struct.CommandReaderBuilder.html).
pub fn new(
cmd: &mut process::Command,
) -> Result<CommandReader, CommandError> {
CommandReaderBuilder::new().build(cmd)
}
}
impl io::Read for CommandReader {
fn read(&mut self, buf: &mut [u8]) -> io::Result<usize> {
if self.done {
return Ok(0);
}
let nread = self.stdout.read(buf)?;
if nread == 0 {
self.done = true;
// Reap the child now that we're done reading. If the command
// failed, report stderr as an error.
if !self.child.wait()?.success() {
return Err(io::Error::from(self.stderr.read_to_end()));
}
}
Ok(nread)
}
}
/// A reader that encapsulates the asynchronous or synchronous reading of
/// stderr.
#[derive(Debug)]
enum StderrReader {
Async(Option<JoinHandle<CommandError>>),
Sync(process::ChildStderr),
}
impl StderrReader {
/// Create a reader for stderr that reads contents asynchronously.
fn async(mut stderr: process::ChildStderr) -> StderrReader {
let handle = thread::spawn(move || {
stderr_to_command_error(&mut stderr)
});
StderrReader::Async(Some(handle))
}
/// Create a reader for stderr that reads contents synchronously.
fn sync(stderr: process::ChildStderr) -> StderrReader {
StderrReader::Sync(stderr)
}
/// Consumes all of stderr on to the heap and returns it as an error.
///
/// If there was a problem reading stderr itself, then this returns an I/O
/// command error.
fn read_to_end(&mut self) -> CommandError {
match *self {
StderrReader::Async(ref mut handle) => {
let handle = handle
.take()
.expect("read_to_end cannot be called more than once");
handle
.join()
.expect("stderr reading thread does not panic")
}
StderrReader::Sync(ref mut stderr) => {
stderr_to_command_error(stderr)
}
}
}
}
fn stderr_to_command_error(stderr: &mut process::ChildStderr) -> CommandError {
let mut bytes = vec![];
match stderr.read_to_end(&mut bytes) {
Ok(_) => CommandError::stderr(bytes),
Err(err) => CommandError::io(err),
}
}

133
grep-cli/src/wtr.rs Normal file
View File

@@ -0,0 +1,133 @@
use std::io;
use termcolor;
use is_tty_stdout;
/// A writer that supports coloring with either line or block buffering.
pub struct StandardStream(StandardStreamKind);
/// Returns a possibly buffered writer to stdout for the given color choice.
///
/// The writer returned is either line buffered or block buffered. The decision
/// between these two is made automatically based on whether a tty is attached
/// to stdout or not. If a tty is attached, then line buffering is used.
/// Otherwise, block buffering is used. In general, block buffering is more
/// efficient, but may increase the time it takes for the end user to see the
/// first bits of output.
///
/// If you need more fine grained control over the buffering mode, then use one
/// of `stdout_buffered_line` or `stdout_buffered_block`.
///
/// The color choice given is passed along to the underlying writer. To
/// completely disable colors in all cases, use `ColorChoice::Never`.
pub fn stdout(color_choice: termcolor::ColorChoice) -> StandardStream {
if is_tty_stdout() {
stdout_buffered_line(color_choice)
} else {
stdout_buffered_block(color_choice)
}
}
/// Returns a line buffered writer to stdout for the given color choice.
///
/// This writer is useful when printing results directly to a tty such that
/// users see output as soon as it's written. The downside of this approach
/// is that it can be slower, especially when there is a lot of output.
///
/// You might consider using
/// [`stdout`](fn.stdout.html)
/// instead, which chooses the buffering strategy automatically based on
/// whether stdout is connected to a tty.
pub fn stdout_buffered_line(
color_choice: termcolor::ColorChoice,
) -> StandardStream {
let out = termcolor::StandardStream::stdout(color_choice);
StandardStream(StandardStreamKind::LineBuffered(out))
}
/// Returns a block buffered writer to stdout for the given color choice.
///
/// This writer is useful when printing results to a file since it amortizes
/// the cost of writing data. The downside of this approach is that it can
/// increase the latency of display output when writing to a tty.
///
/// You might consider using
/// [`stdout`](fn.stdout.html)
/// instead, which chooses the buffering strategy automatically based on
/// whether stdout is connected to a tty.
pub fn stdout_buffered_block(
color_choice: termcolor::ColorChoice,
) -> StandardStream {
let out = termcolor::BufferedStandardStream::stdout(color_choice);
StandardStream(StandardStreamKind::BlockBuffered(out))
}
enum StandardStreamKind {
LineBuffered(termcolor::StandardStream),
BlockBuffered(termcolor::BufferedStandardStream),
}
impl io::Write for StandardStream {
#[inline]
fn write(&mut self, buf: &[u8]) -> io::Result<usize> {
use self::StandardStreamKind::*;
match self.0 {
LineBuffered(ref mut w) => w.write(buf),
BlockBuffered(ref mut w) => w.write(buf),
}
}
#[inline]
fn flush(&mut self) -> io::Result<()> {
use self::StandardStreamKind::*;
match self.0 {
LineBuffered(ref mut w) => w.flush(),
BlockBuffered(ref mut w) => w.flush(),
}
}
}
impl termcolor::WriteColor for StandardStream {
#[inline]
fn supports_color(&self) -> bool {
use self::StandardStreamKind::*;
match self.0 {
LineBuffered(ref w) => w.supports_color(),
BlockBuffered(ref w) => w.supports_color(),
}
}
#[inline]
fn set_color(&mut self, spec: &termcolor::ColorSpec) -> io::Result<()> {
use self::StandardStreamKind::*;
match self.0 {
LineBuffered(ref mut w) => w.set_color(spec),
BlockBuffered(ref mut w) => w.set_color(spec),
}
}
#[inline]
fn reset(&mut self) -> io::Result<()> {
use self::StandardStreamKind::*;
match self.0 {
LineBuffered(ref mut w) => w.reset(),
BlockBuffered(ref mut w) => w.reset(),
}
}
#[inline]
fn is_synchronous(&self) -> bool {
use self::StandardStreamKind::*;
match self.0 {
LineBuffered(ref w) => w.is_synchronous(),
BlockBuffered(ref w) => w.is_synchronous(),
}
}
}

View File

@@ -1,6 +1,6 @@
[package]
name = "grep-matcher"
version = "0.1.0" #:version
version = "0.1.2" #:version
authors = ["Andrew Gallant <jamslam@gmail.com>"]
description = """
A trait for regular expressions, with a focus on line oriented search.
@@ -14,10 +14,10 @@ license = "Unlicense/MIT"
autotests = false
[dependencies]
memchr = "2"
memchr = "2.1"
[dev-dependencies]
regex = "1"
regex = "1.1"
[[test]]
name = "integration"

View File

@@ -134,7 +134,7 @@ fn find_cap_ref(replacement: &[u8]) -> Option<CaptureRef> {
/// Returns true if and only if the given byte is allowed in a capture name.
fn is_valid_cap_letter(b: &u8) -> bool {
match *b {
b'0' ... b'9' | b'a' ... b'z' | b'A' ... b'Z' | b'_' => true,
b'0' ..= b'9' | b'a' ..= b'z' | b'A' ..= b'Z' | b'_' => true,
_ => false,
}
}

View File

@@ -266,6 +266,16 @@ impl LineTerminator {
LineTerminatorImp::CRLF => &[b'\r', b'\n'],
}
}
/// Returns true if and only if the given slice ends with this line
/// terminator.
///
/// If this line terminator is `CRLF`, then this only checks whether the
/// last byte is `\n`.
#[inline]
pub fn is_suffix(&self, slice: &[u8]) -> bool {
slice.last().map_or(false, |&b| b == self.as_byte())
}
}
impl Default for LineTerminator {

View File

@@ -1,6 +1,6 @@
[package]
name = "grep-pcre2"
version = "0.1.0" #:version
version = "0.1.3" #:version
authors = ["Andrew Gallant <jamslam@gmail.com>"]
description = """
Use PCRE2 with the 'grep' crate.
@@ -13,5 +13,5 @@ keywords = ["regex", "grep", "pcre", "backreference", "look"]
license = "Unlicense/MIT"
[dependencies]
grep-matcher = { version = "0.1.0", path = "../grep-matcher" }
pcre2 = "0.1"
grep-matcher = { version = "0.1.2", path = "../grep-matcher" }
pcre2 = "0.2.0"

View File

@@ -10,6 +10,7 @@ extern crate pcre2;
pub use error::{Error, ErrorKind};
pub use matcher::{RegexCaptures, RegexMatcher, RegexMatcherBuilder};
pub use pcre2::{is_jit_available, version};
mod error;
mod matcher;

View File

@@ -199,16 +199,55 @@ impl RegexMatcherBuilder {
self
}
/// Enable PCRE2's JIT.
/// Enable PCRE2's JIT and return an error if it's not available.
///
/// This generally speeds up matching quite a bit. The downside is that it
/// can increase the time it takes to compile a pattern.
///
/// This is disabled by default.
/// If the JIT isn't available or if JIT compilation returns an error, then
/// regex compilation will fail with the corresponding error.
///
/// This is disabled by default, and always overrides `jit_if_available`.
pub fn jit(&mut self, yes: bool) -> &mut RegexMatcherBuilder {
self.builder.jit(yes);
self
}
/// Enable PCRE2's JIT if it's available.
///
/// This generally speeds up matching quite a bit. The downside is that it
/// can increase the time it takes to compile a pattern.
///
/// If the JIT isn't available or if JIT compilation returns an error,
/// then a debug message with the error will be emitted and the regex will
/// otherwise silently fall back to non-JIT matching.
///
/// This is disabled by default, and always overrides `jit`.
pub fn jit_if_available(&mut self, yes: bool) -> &mut RegexMatcherBuilder {
self.builder.jit_if_available(yes);
self
}
/// Set the maximum size of PCRE2's JIT stack, in bytes. If the JIT is
/// not enabled, then this has no effect.
///
/// When `None` is given, no custom JIT stack will be created, and instead,
/// the default JIT stack is used. When the default is used, its maximum
/// size is 32 KB.
///
/// When this is set, then a new JIT stack will be created with the given
/// maximum size as its limit.
///
/// Increasing the stack size can be useful for larger regular expressions.
///
/// By default, this is set to `None`.
pub fn max_jit_stack_size(
&mut self,
bytes: Option<usize>,
) -> &mut RegexMatcherBuilder {
self.builder.max_jit_stack_size(bytes);
self
}
}
/// An implementation of the `Matcher` trait using PCRE2.

View File

@@ -1,6 +1,6 @@
[package]
name = "grep-printer"
version = "0.1.0" #:version
version = "0.1.3" #:version
authors = ["Andrew Gallant <jamslam@gmail.com>"]
description = """
An implementation of the grep crate's Sink trait that provides standard
@@ -18,13 +18,14 @@ default = ["serde1"]
serde1 = ["base64", "serde", "serde_derive", "serde_json"]
[dependencies]
base64 = { version = "0.9", optional = true }
grep-matcher = { version = "0.1.0", path = "../grep-matcher" }
grep-searcher = { version = "0.1.0", path = "../grep-searcher" }
termcolor = "1"
serde = { version = "1", optional = true }
serde_derive = { version = "1", optional = true }
serde_json = { version = "1", optional = true }
base64 = { version = "0.10.0", optional = true }
bstr = "0.2.0"
grep-matcher = { version = "0.1.2", path = "../grep-matcher" }
grep-searcher = { version = "0.1.4", path = "../grep-searcher" }
termcolor = "1.0.4"
serde = { version = "1.0.77", optional = true }
serde_derive = { version = "1.0.77", optional = true }
serde_json = { version = "1.0.27", optional = true }
[dev-dependencies]
grep-regex = { version = "0.1.0", path = "../grep-regex" }
grep-regex = { version = "0.1.3", path = "../grep-regex" }

View File

@@ -4,6 +4,25 @@ use std::str::FromStr;
use termcolor::{Color, ColorSpec, ParseColorError};
/// Returns a default set of color specifications.
///
/// This may change over time, but the color choices are meant to be fairly
/// conservative that work across terminal themes.
///
/// Additional color specifications can be added to the list returned. More
/// recently added specifications override previously added specifications.
pub fn default_color_specs() -> Vec<UserColorSpec> {
vec![
#[cfg(unix)]
"path:fg:magenta".parse().unwrap(),
#[cfg(windows)]
"path:fg:cyan".parse().unwrap(),
"line:fg:green".parse().unwrap(),
"match:fg:red".parse().unwrap(),
"match:style:bold".parse().unwrap(),
]
}
/// An error that can occur when parsing color specifications.
#[derive(Clone, Debug, Eq, PartialEq)]
pub enum ColorError {
@@ -227,6 +246,15 @@ impl ColorSpecs {
merged
}
/// Create a default set of specifications that have color.
///
/// This is distinct from `ColorSpecs`'s `Default` implementation in that
/// this provides a set of default color choices, where as the `Default`
/// implementation provides no color choices.
pub fn default_with_color() -> ColorSpecs {
ColorSpecs::new(&default_color_specs())
}
/// Return the color specification for coloring file paths.
pub fn path(&self) -> &ColorSpec {
&self.path

View File

@@ -817,7 +817,8 @@ impl<'a> SubMatches<'a> {
#[cfg(test)]
mod tests {
use grep_regex::RegexMatcher;
use grep_regex::{RegexMatcher, RegexMatcherBuilder};
use grep_matcher::LineTerminator;
use grep_searcher::SearcherBuilder;
use super::{JSON, JSONBuilder};
@@ -918,4 +919,45 @@ and exhibited clearly, with a label attached.\
assert_eq!(got.lines().count(), 2);
assert!(got.contains("begin") && got.contains("end"));
}
#[test]
fn missing_crlf() {
let haystack = "test\r\n".as_bytes();
let matcher = RegexMatcherBuilder::new()
.build("test")
.unwrap();
let mut printer = JSONBuilder::new()
.build(vec![]);
SearcherBuilder::new()
.build()
.search_reader(&matcher, haystack, printer.sink(&matcher))
.unwrap();
let got = printer_contents(&mut printer);
assert_eq!(got.lines().count(), 3);
assert!(
got.lines().nth(1).unwrap().contains(r"test\r\n"),
r"missing 'test\r\n' in '{}'",
got.lines().nth(1).unwrap(),
);
let matcher = RegexMatcherBuilder::new()
.crlf(true)
.build("test")
.unwrap();
let mut printer = JSONBuilder::new()
.build(vec![]);
SearcherBuilder::new()
.line_terminator(LineTerminator::crlf())
.build()
.search_reader(&matcher, haystack, printer.sink(&matcher))
.unwrap();
let got = printer_contents(&mut printer);
assert_eq!(got.lines().count(), 3);
assert!(
got.lines().nth(1).unwrap().contains(r"test\r\n"),
r"missing 'test\r\n' in '{}'",
got.lines().nth(1).unwrap(),
);
}
}

View File

@@ -114,39 +114,6 @@ impl<'a> Data<'a> {
// so we do the easy thing for now.
Data::Text { text: path.to_string_lossy() }
}
// Unused deserialization routines.
/*
fn into_bytes(self) -> Vec<u8> {
match self {
Data::Text { text } => text.into_bytes(),
Data::Bytes { bytes } => bytes,
}
}
#[cfg(unix)]
fn into_path_buf(&self) -> PathBuf {
use std::os::unix::ffi::OsStrExt;
match self {
Data::Text { text } => PathBuf::from(text),
Data::Bytes { bytes } => {
PathBuf::from(OsStr::from_bytes(bytes))
}
}
}
#[cfg(not(unix))]
fn into_path_buf(&self) -> PathBuf {
match self {
Data::Text { text } => PathBuf::from(text),
Data::Bytes { bytes } => {
PathBuf::from(String::from_utf8_lossy(&bytes).into_owned())
}
}
}
*/
}
fn to_base64<T, S>(
@@ -178,36 +145,3 @@ where P: AsRef<Path>,
{
path.as_ref().map(|p| Data::from_path(p.as_ref())).serialize(ser)
}
// The following are some deserialization helpers, in case we decide to support
// deserialization of the above types.
/*
fn from_base64<'de, D>(
de: D,
) -> Result<Vec<u8>, D::Error>
where D: Deserializer<'de>
{
let encoded = String::deserialize(de)?;
let decoded = base64::decode(encoded.as_bytes())
.map_err(D::Error::custom)?;
Ok(decoded)
}
fn deser_bytes<'de, D>(
de: D,
) -> Result<Vec<u8>, D::Error>
where D: Deserializer<'de>
{
Data::deserialize(de).map(|datum| datum.into_bytes())
}
fn deser_path<'de, D>(
de: D,
) -> Result<Option<PathBuf>, D::Error>
where D: Deserializer<'de>
{
Option::<Data>::deserialize(de)
.map(|opt| opt.map(|datum| datum.into_path_buf()))
}
*/

View File

@@ -70,6 +70,7 @@ fn example() -> Result<(), Box<Error>> {
#[cfg(feature = "serde1")]
extern crate base64;
extern crate bstr;
extern crate grep_matcher;
#[cfg(test)]
extern crate grep_regex;
@@ -83,7 +84,7 @@ extern crate serde_derive;
extern crate serde_json;
extern crate termcolor;
pub use color::{ColorError, ColorSpecs, UserColorSpec};
pub use color::{ColorError, ColorSpecs, UserColorSpec, default_color_specs};
#[cfg(feature = "serde1")]
pub use json::{JSON, JSONBuilder, JSONSink};
pub use standard::{Standard, StandardBuilder, StandardSink};

View File

@@ -1,3 +1,4 @@
/// Like assert_eq, but nicer output for long strings.
#[cfg(test)]
#[macro_export]
macro_rules! assert_eq_printed {

View File

@@ -5,6 +5,7 @@ use std::path::Path;
use std::sync::Arc;
use std::time::Instant;
use bstr::ByteSlice;
use grep_matcher::{Match, Matcher};
use grep_searcher::{
LineStep, Searcher,
@@ -16,10 +17,7 @@ use termcolor::{ColorSpec, NoColor, WriteColor};
use color::ColorSpecs;
use counter::CounterWriter;
use stats::Stats;
use util::{
PrinterPath, Replacer, Sunk,
trim_ascii_prefix, trim_ascii_prefix_range,
};
use util::{PrinterPath, Replacer, Sunk, trim_ascii_prefix};
/// The configuration for the standard printer.
///
@@ -36,6 +34,7 @@ struct Config {
per_match: bool,
replacement: Arc<Option<Vec<u8>>>,
max_columns: Option<u64>,
max_columns_preview: bool,
max_matches: Option<u64>,
column: bool,
byte_offset: bool,
@@ -59,6 +58,7 @@ impl Default for Config {
per_match: false,
replacement: Arc::new(None),
max_columns: None,
max_columns_preview: false,
max_matches: None,
column: false,
byte_offset: false,
@@ -239,8 +239,9 @@ impl StandardBuilder {
/// which may either be in index form (e.g., `$2`) or can reference named
/// capturing groups if present in the original pattern (e.g., `$foo`).
///
/// For documentation on the full format, please see the `Matcher` trait's
/// `interpolate` method.
/// For documentation on the full format, please see the `Capture` trait's
/// `interpolate` method in the
/// [grep-printer](https://docs.rs/grep-printer) crate.
pub fn replacement(
&mut self,
replacement: Option<Vec<u8>>,
@@ -262,6 +263,21 @@ impl StandardBuilder {
self
}
/// When enabled, if a line is found to be over the configured maximum
/// column limit (measured in terms of bytes), then a preview of the long
/// line will be printed instead.
///
/// The preview will correspond to the first `N` *grapheme clusters* of
/// the line, where `N` is the limit configured by `max_columns`.
///
/// If no limit is set, then enabling this has no effect.
///
/// This is disabled by default.
pub fn max_columns_preview(&mut self, yes: bool) -> &mut StandardBuilder {
self.config.max_columns_preview = yes;
self
}
/// Set the maximum amount of matching lines that are printed.
///
/// If multi line search is enabled and a match spans multiple lines, then
@@ -742,6 +758,11 @@ impl<'p, 's, M: Matcher, W: WriteColor> Sink for StandardSink<'p, 's, M, W> {
stats.add_matches(self.standard.matches.len() as u64);
stats.add_matched_lines(mat.lines().count() as u64);
}
if searcher.binary_detection().convert_byte().is_some() {
if self.binary_byte_offset.is_some() {
return Ok(false);
}
}
StandardImpl::from_match(searcher, self, mat).sink()?;
Ok(!self.should_quit())
@@ -763,6 +784,12 @@ impl<'p, 's, M: Matcher, W: WriteColor> Sink for StandardSink<'p, 's, M, W> {
self.record_matches(ctx.bytes())?;
self.replace(ctx.bytes())?;
}
if searcher.binary_detection().convert_byte().is_some() {
if self.binary_byte_offset.is_some() {
return Ok(false);
}
}
StandardImpl::from_context(searcher, self, ctx).sink()?;
Ok(!self.should_quit())
}
@@ -775,6 +802,15 @@ impl<'p, 's, M: Matcher, W: WriteColor> Sink for StandardSink<'p, 's, M, W> {
Ok(true)
}
fn binary_data(
&mut self,
_searcher: &Searcher,
binary_byte_offset: u64,
) -> Result<bool, io::Error> {
self.binary_byte_offset = Some(binary_byte_offset);
Ok(true)
}
fn begin(
&mut self,
_searcher: &Searcher,
@@ -792,10 +828,12 @@ impl<'p, 's, M: Matcher, W: WriteColor> Sink for StandardSink<'p, 's, M, W> {
fn finish(
&mut self,
_searcher: &Searcher,
searcher: &Searcher,
finish: &SinkFinish,
) -> Result<(), io::Error> {
self.binary_byte_offset = finish.binary_byte_offset();
if let Some(offset) = self.binary_byte_offset {
StandardImpl::new(searcher, self).write_binary_message(offset)?;
}
if let Some(stats) = self.stats.as_mut() {
stats.add_elapsed(self.start_time.elapsed());
stats.add_searches(1);
@@ -991,7 +1029,7 @@ impl<'a, M: Matcher, W: WriteColor> StandardImpl<'a, M, W> {
let mut count = 0;
let mut stepper = LineStep::new(line_term, 0, bytes.len());
while let Some((start, end)) = stepper.next(bytes) {
let mut line = Match::new(start, end);
let line = Match::new(start, end);
self.write_prelude(
self.sunk.absolute_byte_offset() + line.start() as u64,
self.sunk.line_number().map(|n| n + count),
@@ -999,43 +1037,11 @@ impl<'a, M: Matcher, W: WriteColor> StandardImpl<'a, M, W> {
)?;
count += 1;
if self.exceeds_max_columns(&bytes[line]) {
self.write_exceeded_line()?;
continue;
self.write_exceeded_line(bytes, line, matches, &mut midx)?;
} else {
self.write_colored_matches(bytes, line, matches, &mut midx)?;
self.write_line_term()?;
}
if self.has_line_terminator(&bytes[line]) {
line = line.with_end(line.end() - 1);
}
if self.config().trim_ascii {
line = self.trim_ascii_prefix_range(bytes, line);
}
while !line.is_empty() {
if matches[midx].end() <= line.start() {
if midx + 1 < matches.len() {
midx += 1;
continue;
} else {
self.end_color_match()?;
self.write(&bytes[line])?;
break;
}
}
let m = matches[midx];
if line.start() < m.start() {
let upto = cmp::min(line.end(), m.start());
self.end_color_match()?;
self.write(&bytes[line.with_end(upto)])?;
line = line.with_start(upto);
} else {
let upto = cmp::min(line.end(), m.end());
self.start_color_match()?;
self.write(&bytes[line.with_end(upto)])?;
line = line.with_start(upto);
}
}
self.end_color_match()?;
self.write_line_term()?;
}
Ok(())
}
@@ -1050,12 +1056,8 @@ impl<'a, M: Matcher, W: WriteColor> StandardImpl<'a, M, W> {
let mut stepper = LineStep::new(line_term, 0, bytes.len());
while let Some((start, end)) = stepper.next(bytes) {
let mut line = Match::new(start, end);
if self.has_line_terminator(&bytes[line]) {
line = line.with_end(line.end() - 1);
}
if self.config().trim_ascii {
line = self.trim_ascii_prefix_range(bytes, line);
}
self.trim_line_terminator(bytes, &mut line);
self.trim_ascii_prefix(bytes, &mut line);
while !line.is_empty() {
if matches[midx].end() <= line.start() {
if midx + 1 < matches.len() {
@@ -1078,14 +1080,19 @@ impl<'a, M: Matcher, W: WriteColor> StandardImpl<'a, M, W> {
Some(m.start() as u64 + 1),
)?;
let buf = &bytes[line.with_end(upto)];
let this_line = line.with_end(upto);
line = line.with_start(upto);
if self.exceeds_max_columns(&buf) {
self.write_exceeded_line()?;
continue;
if self.exceeds_max_columns(&bytes[this_line]) {
self.write_exceeded_line(
bytes,
this_line,
matches,
&mut midx,
)?;
} else {
self.write_spec(spec, &bytes[this_line])?;
self.write_line_term()?;
}
self.write_spec(spec, buf)?;
self.write_line_term()?;
}
}
count += 1;
@@ -1098,7 +1105,6 @@ impl<'a, M: Matcher, W: WriteColor> StandardImpl<'a, M, W> {
let spec = self.config().colors.matched();
let bytes = self.sunk.bytes();
for &m in self.sunk.matches() {
let mut m = m;
let mut count = 0;
let mut stepper = LineStep::new(line_term, 0, bytes.len());
while let Some((start, end)) = stepper.next(bytes) {
@@ -1116,15 +1122,11 @@ impl<'a, M: Matcher, W: WriteColor> StandardImpl<'a, M, W> {
)?;
count += 1;
if self.exceeds_max_columns(&bytes[line]) {
self.write_exceeded_line()?;
self.write_exceeded_line(bytes, line, &[m], &mut 0)?;
continue;
}
if self.has_line_terminator(&bytes[line]) {
line = line.with_end(line.end() - 1);
}
if self.config().trim_ascii {
line = self.trim_ascii_prefix_range(bytes, line);
}
self.trim_line_terminator(bytes, &mut line);
self.trim_ascii_prefix(bytes, &mut line);
while !line.is_empty() {
if m.end() <= line.start() {
@@ -1181,7 +1183,10 @@ impl<'a, M: Matcher, W: WriteColor> StandardImpl<'a, M, W> {
line: &[u8],
) -> io::Result<()> {
if self.exceeds_max_columns(line) {
self.write_exceeded_line()?;
let range = Match::new(0, line.len());
self.write_exceeded_line(
line, range, self.sunk.matches(), &mut 0,
)?;
} else {
self.write_trim(line)?;
if !self.has_line_terminator(line) {
@@ -1194,47 +1199,114 @@ impl<'a, M: Matcher, W: WriteColor> StandardImpl<'a, M, W> {
fn write_colored_line(
&self,
matches: &[Match],
line: &[u8],
bytes: &[u8],
) -> io::Result<()> {
// If we know we aren't going to emit color, then we can go faster.
let spec = self.config().colors.matched();
if !self.wtr().borrow().supports_color() || spec.is_none() {
return self.write_line(line);
return self.write_line(bytes);
}
let mut last_written =
if !self.config().trim_ascii {
0
} else {
self.trim_ascii_prefix_range(
line,
Match::new(0, line.len()),
).start()
};
for mut m in matches.iter().map(|&m| m) {
if last_written < m.start() {
let line = Match::new(0, bytes.len());
if self.exceeds_max_columns(bytes) {
self.write_exceeded_line(bytes, line, matches, &mut 0)
} else {
self.write_colored_matches(bytes, line, matches, &mut 0)?;
self.write_line_term()?;
Ok(())
}
}
/// Write the `line` portion of `bytes`, with appropriate coloring for
/// each `match`, starting at `match_index`.
///
/// This accounts for trimming any whitespace prefix and will *never* print
/// a line terminator. If a match exceeds the range specified by `line`,
/// then only the part of the match within `line` (if any) is printed.
fn write_colored_matches(
&self,
bytes: &[u8],
mut line: Match,
matches: &[Match],
match_index: &mut usize,
) -> io::Result<()> {
self.trim_line_terminator(bytes, &mut line);
self.trim_ascii_prefix(bytes, &mut line);
if matches.is_empty() {
self.write(&bytes[line])?;
return Ok(());
}
while !line.is_empty() {
if matches[*match_index].end() <= line.start() {
if *match_index + 1 < matches.len() {
*match_index += 1;
continue;
} else {
self.end_color_match()?;
self.write(&bytes[line])?;
break;
}
}
let m = matches[*match_index];
if line.start() < m.start() {
let upto = cmp::min(line.end(), m.start());
self.end_color_match()?;
self.write(&line[last_written..m.start()])?;
} else if last_written < m.end() {
m = m.with_start(last_written);
self.write(&bytes[line.with_end(upto)])?;
line = line.with_start(upto);
} else {
continue;
}
if !m.is_empty() {
let upto = cmp::min(line.end(), m.end());
self.start_color_match()?;
self.write(&line[m])?;
self.write(&bytes[line.with_end(upto)])?;
line = line.with_start(upto);
}
last_written = m.end();
}
self.end_color_match()?;
self.write(&line[last_written..])?;
if !self.has_line_terminator(line) {
self.write_line_term()?;
}
Ok(())
}
fn write_exceeded_line(&self) -> io::Result<()> {
fn write_exceeded_line(
&self,
bytes: &[u8],
mut line: Match,
matches: &[Match],
match_index: &mut usize,
) -> io::Result<()> {
if self.config().max_columns_preview {
let original = line;
let end = bytes[line]
.grapheme_indices()
.map(|(_, end, _)| end)
.take(self.config().max_columns.unwrap_or(0) as usize)
.last()
.unwrap_or(0) + line.start();
line = line.with_end(end);
self.write_colored_matches(bytes, line, matches, match_index)?;
if matches.is_empty() {
self.write(b" [... omitted end of long line]")?;
} else {
let remaining = matches
.iter()
.filter(|m| {
m.start() >= line.end() && m.start() < original.end()
})
.count();
let tense =
if remaining == 1 {
"match"
} else {
"matches"
};
write!(
self.wtr().borrow_mut(),
" [... {} more {}]",
remaining, tense,
)?;
}
self.write_line_term()?;
return Ok(());
}
if self.sunk.original_matches().is_empty() {
if self.is_context() {
self.write(b"[Omitted long context line]")?;
@@ -1310,6 +1382,38 @@ impl<'a, M: Matcher, W: WriteColor> StandardImpl<'a, M, W> {
Ok(())
}
fn write_binary_message(&self, offset: u64) -> io::Result<()> {
if self.sink.match_count == 0 {
return Ok(());
}
let bin = self.searcher.binary_detection();
if let Some(byte) = bin.quit_byte() {
self.write(b"WARNING: stopped searching binary file ")?;
if let Some(path) = self.path() {
self.write_spec(self.config().colors.path(), path.as_bytes())?;
self.write(b" ")?;
}
let remainder = format!(
"after match (found {:?} byte around offset {})\n",
[byte].as_bstr(), offset,
);
self.write(remainder.as_bytes())?;
} else if let Some(byte) = bin.convert_byte() {
self.write(b"Binary file ")?;
if let Some(path) = self.path() {
self.write_spec(self.config().colors.path(), path.as_bytes())?;
self.write(b" ")?;
}
let remainder = format!(
"matches (found {:?} byte around offset {})\n",
[byte].as_bstr(), offset,
);
self.write(remainder.as_bytes())?;
}
Ok(())
}
fn write_context_separator(&self) -> io::Result<()> {
if let Some(ref sep) = *self.config().separator_context {
self.write(sep)?;
@@ -1385,15 +1489,28 @@ impl<'a, M: Matcher, W: WriteColor> StandardImpl<'a, M, W> {
if !self.config().trim_ascii {
return self.write(buf);
}
self.write(self.trim_ascii_prefix(buf))
let mut range = Match::new(0, buf.len());
self.trim_ascii_prefix(buf, &mut range);
self.write(&buf[range])
}
fn write(&self, buf: &[u8]) -> io::Result<()> {
self.wtr().borrow_mut().write_all(buf)
}
fn trim_line_terminator(&self, buf: &[u8], line: &mut Match) {
let lineterm = self.searcher.line_terminator();
if lineterm.is_suffix(&buf[*line]) {
let mut end = line.end() - 1;
if lineterm.is_crlf() && buf[end - 1] == b'\r' {
end -= 1;
}
*line = line.with_end(end);
}
}
fn has_line_terminator(&self, buf: &[u8]) -> bool {
buf.last() == Some(&self.searcher.line_terminator().as_byte())
self.searcher.line_terminator().is_suffix(buf)
}
fn is_context(&self) -> bool {
@@ -1447,14 +1564,12 @@ impl<'a, M: Matcher, W: WriteColor> StandardImpl<'a, M, W> {
///
/// This stops trimming a prefix as soon as it sees non-whitespace or a
/// line terminator.
fn trim_ascii_prefix_range(&self, slice: &[u8], range: Match) -> Match {
trim_ascii_prefix_range(self.searcher.line_terminator(), slice, range)
}
/// Trim prefix ASCII spaces from the given slice and return the
/// corresponding sub-slice.
fn trim_ascii_prefix<'s>(&self, slice: &'s [u8]) -> &'s [u8] {
trim_ascii_prefix(self.searcher.line_terminator(), slice)
fn trim_ascii_prefix(&self, slice: &[u8], range: &mut Match) {
if !self.config().trim_ascii {
return;
}
let lineterm = self.searcher.line_terminator();
*range = trim_ascii_prefix(lineterm, slice, *range)
}
}
@@ -2221,6 +2336,31 @@ but Doctor Watson has to have it taken out for him and dusted,
assert_eq_printed!(expected, got);
}
#[test]
fn max_columns_preview() {
let matcher = RegexMatcher::new("exhibited|dusted").unwrap();
let mut printer = StandardBuilder::new()
.max_columns(Some(46))
.max_columns_preview(true)
.build(NoColor::new(vec![]));
SearcherBuilder::new()
.line_number(false)
.build()
.search_reader(
&matcher,
SHERLOCK.as_bytes(),
printer.sink(&matcher),
)
.unwrap();
let got = printer_contents(&mut printer);
let expected = "\
but Doctor Watson has to have it taken out for [... omitted end of long line]
and exhibited clearly, with a label attached.
";
assert_eq_printed!(expected, got);
}
#[test]
fn max_columns_with_count() {
let matcher = RegexMatcher::new("cigar|ash|dusted").unwrap();
@@ -2246,6 +2386,86 @@ but Doctor Watson has to have it taken out for him and dusted,
assert_eq_printed!(expected, got);
}
#[test]
fn max_columns_with_count_preview_no_match() {
let matcher = RegexMatcher::new("exhibited|has to have it").unwrap();
let mut printer = StandardBuilder::new()
.stats(true)
.max_columns(Some(46))
.max_columns_preview(true)
.build(NoColor::new(vec![]));
SearcherBuilder::new()
.line_number(false)
.build()
.search_reader(
&matcher,
SHERLOCK.as_bytes(),
printer.sink(&matcher),
)
.unwrap();
let got = printer_contents(&mut printer);
let expected = "\
but Doctor Watson has to have it taken out for [... 0 more matches]
and exhibited clearly, with a label attached.
";
assert_eq_printed!(expected, got);
}
#[test]
fn max_columns_with_count_preview_one_match() {
let matcher = RegexMatcher::new("exhibited|dusted").unwrap();
let mut printer = StandardBuilder::new()
.stats(true)
.max_columns(Some(46))
.max_columns_preview(true)
.build(NoColor::new(vec![]));
SearcherBuilder::new()
.line_number(false)
.build()
.search_reader(
&matcher,
SHERLOCK.as_bytes(),
printer.sink(&matcher),
)
.unwrap();
let got = printer_contents(&mut printer);
let expected = "\
but Doctor Watson has to have it taken out for [... 1 more match]
and exhibited clearly, with a label attached.
";
assert_eq_printed!(expected, got);
}
#[test]
fn max_columns_with_count_preview_two_matches() {
let matcher = RegexMatcher::new(
"exhibited|dusted|has to have it",
).unwrap();
let mut printer = StandardBuilder::new()
.stats(true)
.max_columns(Some(46))
.max_columns_preview(true)
.build(NoColor::new(vec![]));
SearcherBuilder::new()
.line_number(false)
.build()
.search_reader(
&matcher,
SHERLOCK.as_bytes(),
printer.sink(&matcher),
)
.unwrap();
let got = printer_contents(&mut printer);
let expected = "\
but Doctor Watson has to have it taken out for [... 1 more match]
and exhibited clearly, with a label attached.
";
assert_eq_printed!(expected, got);
}
#[test]
fn max_columns_multi_line() {
let matcher = RegexMatcher::new("(?s)ash.+dusted").unwrap();
@@ -2271,6 +2491,36 @@ but Doctor Watson has to have it taken out for him and dusted,
assert_eq_printed!(expected, got);
}
#[test]
fn max_columns_multi_line_preview() {
let matcher = RegexMatcher::new(
"(?s)clew|cigar ash.+have it|exhibited",
).unwrap();
let mut printer = StandardBuilder::new()
.stats(true)
.max_columns(Some(46))
.max_columns_preview(true)
.build(NoColor::new(vec![]));
SearcherBuilder::new()
.line_number(false)
.multi_line(true)
.build()
.search_reader(
&matcher,
SHERLOCK.as_bytes(),
printer.sink(&matcher),
)
.unwrap();
let got = printer_contents(&mut printer);
let expected = "\
can extract a clew from a wisp of straw or a f [... 1 more match]
but Doctor Watson has to have it taken out for [... 0 more matches]
and exhibited clearly, with a label attached.
";
assert_eq_printed!(expected, got);
}
#[test]
fn max_matches() {
let matcher = RegexMatcher::new("Sherlock").unwrap();
@@ -2560,8 +2810,40 @@ Holmeses, success in the province of detective work must always
assert_eq_printed!(expected, got);
}
#[test]
fn only_matching_max_columns_preview() {
let matcher = RegexMatcher::new("Doctor Watsons|Sherlock").unwrap();
let mut printer = StandardBuilder::new()
.only_matching(true)
.max_columns(Some(10))
.max_columns_preview(true)
.column(true)
.build(NoColor::new(vec![]));
SearcherBuilder::new()
.line_number(true)
.build()
.search_reader(
&matcher,
SHERLOCK.as_bytes(),
printer.sink(&matcher),
)
.unwrap();
let got = printer_contents(&mut printer);
let expected = "\
1:9:Doctor Wat [... 0 more matches]
1:57:Sherlock
3:49:Sherlock
";
assert_eq_printed!(expected, got);
}
#[test]
fn only_matching_max_columns_multi_line1() {
// The `(?s:.{0})` trick fools the matcher into thinking that it
// can match across multiple lines without actually doing so. This is
// so we can test multi-line handling in the case of a match on only
// one line.
let matcher = RegexMatcher::new(
r"(?s:.{0})(Doctor Watsons|Sherlock)"
).unwrap();
@@ -2590,6 +2872,41 @@ Holmeses, success in the province of detective work must always
assert_eq_printed!(expected, got);
}
#[test]
fn only_matching_max_columns_preview_multi_line1() {
// The `(?s:.{0})` trick fools the matcher into thinking that it
// can match across multiple lines without actually doing so. This is
// so we can test multi-line handling in the case of a match on only
// one line.
let matcher = RegexMatcher::new(
r"(?s:.{0})(Doctor Watsons|Sherlock)"
).unwrap();
let mut printer = StandardBuilder::new()
.only_matching(true)
.max_columns(Some(10))
.max_columns_preview(true)
.column(true)
.build(NoColor::new(vec![]));
SearcherBuilder::new()
.multi_line(true)
.line_number(true)
.build()
.search_reader(
&matcher,
SHERLOCK.as_bytes(),
printer.sink(&matcher),
)
.unwrap();
let got = printer_contents(&mut printer);
let expected = "\
1:9:Doctor Wat [... 0 more matches]
1:57:Sherlock
3:49:Sherlock
";
assert_eq_printed!(expected, got);
}
#[test]
fn only_matching_max_columns_multi_line2() {
let matcher = RegexMatcher::new(
@@ -2621,6 +2938,38 @@ Holmeses, success in the province of detective work must always
assert_eq_printed!(expected, got);
}
#[test]
fn only_matching_max_columns_preview_multi_line2() {
let matcher = RegexMatcher::new(
r"(?s)Watson.+?(Holmeses|clearly)"
).unwrap();
let mut printer = StandardBuilder::new()
.only_matching(true)
.max_columns(Some(50))
.max_columns_preview(true)
.column(true)
.build(NoColor::new(vec![]));
SearcherBuilder::new()
.multi_line(true)
.line_number(true)
.build()
.search_reader(
&matcher,
SHERLOCK.as_bytes(),
printer.sink(&matcher),
)
.unwrap();
let got = printer_contents(&mut printer);
let expected = "\
1:16:Watsons of this world, as opposed to the Sherlock
2:16:Holmeses
5:12:Watson has to have it taken out for him and dusted [... 0 more matches]
6:12:and exhibited clearly
";
assert_eq_printed!(expected, got);
}
#[test]
fn per_match() {
let matcher = RegexMatcher::new("Doctor Watsons|Sherlock").unwrap();
@@ -2816,6 +3165,61 @@ Holmeses, success in the province of detective work must always
assert_eq_printed!(expected, got);
}
#[test]
fn replacement_max_columns_preview1() {
let matcher = RegexMatcher::new(r"Sherlock|Doctor (\w+)").unwrap();
let mut printer = StandardBuilder::new()
.max_columns(Some(67))
.max_columns_preview(true)
.replacement(Some(b"doctah $1 MD".to_vec()))
.build(NoColor::new(vec![]));
SearcherBuilder::new()
.line_number(true)
.build()
.search_reader(
&matcher,
SHERLOCK.as_bytes(),
printer.sink(&matcher),
)
.unwrap();
let got = printer_contents(&mut printer);
let expected = "\
1:For the doctah Watsons MD of this world, as opposed to the doctah [... 0 more matches]
3:be, to a very large extent, the result of luck. doctah MD Holmes
5:but doctah Watson MD has to have it taken out for him and dusted,
";
assert_eq_printed!(expected, got);
}
#[test]
fn replacement_max_columns_preview2() {
let matcher = RegexMatcher::new(
"exhibited|dusted|has to have it",
).unwrap();
let mut printer = StandardBuilder::new()
.max_columns(Some(43))
.max_columns_preview(true)
.replacement(Some(b"xxx".to_vec()))
.build(NoColor::new(vec![]));
SearcherBuilder::new()
.line_number(false)
.build()
.search_reader(
&matcher,
SHERLOCK.as_bytes(),
printer.sink(&matcher),
)
.unwrap();
let got = printer_contents(&mut printer);
let expected = "\
but Doctor Watson xxx taken out for him and [... 1 more match]
and xxx clearly, with a label attached.
";
assert_eq_printed!(expected, got);
}
#[test]
fn replacement_only_matching() {
let matcher = RegexMatcher::new(r"Sherlock|Doctor (\w+)").unwrap();

View File

@@ -403,7 +403,7 @@ impl<W: WriteColor> Summary<W> {
where M: Matcher,
P: ?Sized + AsRef<Path>,
{
if !self.config.path {
if !self.config.path && !self.config.kind.requires_path() {
return self.sink(matcher);
}
let stats =
@@ -477,7 +477,10 @@ impl<'p, 's, M: Matcher, W: WriteColor> SummarySink<'p, 's, M, W> {
/// This is unaffected by the result of searches before the previous
/// search.
pub fn has_match(&self) -> bool {
self.match_count > 0
match self.summary.config.kind {
SummaryKind::PathWithoutMatch => self.match_count == 0,
_ => self.match_count > 0,
}
}
/// If binary data was found in the previous search, this returns the
@@ -633,6 +636,34 @@ impl<'p, 's, M: Matcher, W: WriteColor> Sink for SummarySink<'p, 's, M, W> {
stats.add_bytes_searched(finish.byte_count());
stats.add_bytes_printed(self.summary.wtr.borrow().count());
}
// If our binary detection method says to quit after seeing binary
// data, then we shouldn't print any results at all, even if we've
// found a match before detecting binary data. The intent here is to
// keep BinaryDetection::quit as a form of filter. Otherwise, we can
// present a matching file with a smaller number of matches than
// there might be, which can be quite misleading.
//
// If our binary detection method is to convert binary data, then we
// don't quit and therefore search the entire contents of the file.
//
// There is an unfortunate inconsistency here. Namely, when using
// Quiet or PathWithMatch, then the printer can quit after the first
// match seen, which could be long before seeing binary data. This
// means that using PathWithMatch can print a path where as using
// Count might not print it at all because of binary data.
//
// It's not possible to fix this without also potentially significantly
// impacting the performance of Quiet or PathWithMatch, so we accept
// the bug.
if self.binary_byte_offset.is_some()
&& searcher.binary_detection().quit_byte().is_some()
{
// Squash the match count. The statistics reported will still
// contain the match count, but the "official" match count should
// be zero.
self.match_count = 0;
return Ok(());
}
let show_count =
!self.summary.config.exclude_zero

View File

@@ -4,6 +4,7 @@ use std::io;
use std::path::Path;
use std::time;
use bstr::{ByteSlice, ByteVec};
use grep_matcher::{Captures, LineTerminator, Match, Matcher};
use grep_searcher::{
LineIter,
@@ -267,21 +268,7 @@ pub struct PrinterPath<'a>(Cow<'a, [u8]>);
impl<'a> PrinterPath<'a> {
/// Create a new path suitable for printing.
pub fn new(path: &'a Path) -> PrinterPath<'a> {
PrinterPath::new_impl(path)
}
#[cfg(unix)]
fn new_impl(path: &'a Path) -> PrinterPath<'a> {
use std::os::unix::ffi::OsStrExt;
PrinterPath(Cow::Borrowed(path.as_os_str().as_bytes()))
}
#[cfg(not(unix))]
fn new_impl(path: &'a Path) -> PrinterPath<'a> {
PrinterPath(match path.to_string_lossy() {
Cow::Owned(path) => Cow::Owned(path.into_bytes()),
Cow::Borrowed(path) => Cow::Borrowed(path.as_bytes()),
})
PrinterPath(Vec::from_path_lossy(path))
}
/// Create a new printer path from the given path which can be efficiently
@@ -302,7 +289,7 @@ impl<'a> PrinterPath<'a> {
/// path separators that are both replaced by `new_sep`. In all other
/// environments, only `/` is treated as a path separator.
fn replace_separator(&mut self, new_sep: u8) {
let transformed_path: Vec<_> = self.as_bytes().iter().map(|&b| {
let transformed_path: Vec<u8> = self.0.bytes().map(|b| {
if b == b'/' || (cfg!(windows) && b == b'\\') {
new_sep
} else {
@@ -314,7 +301,7 @@ impl<'a> PrinterPath<'a> {
/// Return the raw bytes for this path.
pub fn as_bytes(&self) -> &[u8] {
&*self.0
&self.0
}
}
@@ -359,7 +346,7 @@ impl Serialize for NiceDuration {
///
/// This stops trimming a prefix as soon as it sees non-whitespace or a line
/// terminator.
pub fn trim_ascii_prefix_range(
pub fn trim_ascii_prefix(
line_term: LineTerminator,
slice: &[u8],
range: Match,
@@ -379,14 +366,3 @@ pub fn trim_ascii_prefix_range(
.count();
range.with_start(range.start() + count)
}
/// Trim prefix ASCII spaces from the given slice and return the corresponding
/// sub-slice.
pub fn trim_ascii_prefix(line_term: LineTerminator, slice: &[u8]) -> &[u8] {
let range = trim_ascii_prefix_range(
line_term,
slice,
Match::new(0, slice.len()),
);
&slice[range]
}

View File

@@ -1,6 +1,6 @@
[package]
name = "grep-regex"
version = "0.1.0" #:version
version = "0.1.3" #:version
authors = ["Andrew Gallant <jamslam@gmail.com>"]
description = """
Use Rust's regex library with the 'grep' crate.
@@ -13,9 +13,10 @@ keywords = ["regex", "grep", "search", "pattern", "line"]
license = "Unlicense/MIT"
[dependencies]
log = "0.4"
grep-matcher = { version = "0.1.0", path = "../grep-matcher" }
regex = "1"
regex-syntax = "0.6"
aho-corasick = "0.7.3"
grep-matcher = { version = "0.1.2", path = "../grep-matcher" }
log = "0.4.5"
regex = "1.1"
regex-syntax = "0.6.5"
thread_local = "0.3.6"
utf8-ranges = "1"
utf8-ranges = "1.0.1"

View File

@@ -1,12 +1,13 @@
use grep_matcher::{ByteSet, LineTerminator};
use regex::bytes::{Regex, RegexBuilder};
use regex_syntax::ast::{self, Ast};
use regex_syntax::hir::Hir;
use regex_syntax::hir::{self, Hir};
use ast::AstAnalysis;
use crlf::crlfify;
use error::Error;
use literal::LiteralSets;
use multi::alternation_literals;
use non_matching::non_matching_bytes;
use strip::strip_from_match;
@@ -67,19 +68,17 @@ impl Config {
/// If there was a problem parsing the given expression then an error
/// is returned.
pub fn hir(&self, pattern: &str) -> Result<ConfiguredHIR, Error> {
let analysis = self.analysis(pattern)?;
let expr = ::regex_syntax::ParserBuilder::new()
.nest_limit(self.nest_limit)
.octal(self.octal)
let ast = self.ast(pattern)?;
let analysis = self.analysis(&ast)?;
let expr = hir::translate::TranslatorBuilder::new()
.allow_invalid_utf8(true)
.ignore_whitespace(self.ignore_whitespace)
.case_insensitive(self.is_case_insensitive(&analysis)?)
.case_insensitive(self.is_case_insensitive(&analysis))
.multi_line(self.multi_line)
.dot_matches_new_line(self.dot_matches_new_line)
.swap_greed(self.swap_greed)
.unicode(self.unicode)
.build()
.parse(pattern)
.translate(pattern, &ast)
.map_err(Error::regex)?;
let expr = match self.line_terminator {
None => expr,
@@ -99,21 +98,34 @@ impl Config {
fn is_case_insensitive(
&self,
analysis: &AstAnalysis,
) -> Result<bool, Error> {
) -> bool {
if self.case_insensitive {
return Ok(true);
return true;
}
if !self.case_smart {
return Ok(false);
return false;
}
Ok(analysis.any_literal() && !analysis.any_uppercase())
analysis.any_literal() && !analysis.any_uppercase()
}
/// Returns true if and only if this config is simple enough such that
/// if the pattern is a simple alternation of literals, then it can be
/// constructed via a plain Aho-Corasick automaton.
///
/// Note that it is OK to return true even when settings like `multi_line`
/// are enabled, since if multi-line can impact the match semantics of a
/// regex, then it is by definition not a simple alternation of literals.
pub fn can_plain_aho_corasick(&self) -> bool {
!self.word
&& !self.case_insensitive
&& !self.case_smart
}
/// Perform analysis on the AST of this pattern.
///
/// This returns an error if the given pattern failed to parse.
fn analysis(&self, pattern: &str) -> Result<AstAnalysis, Error> {
Ok(AstAnalysis::from_ast(&self.ast(pattern)?))
fn analysis(&self, ast: &Ast) -> Result<AstAnalysis, Error> {
Ok(AstAnalysis::from_ast(ast))
}
/// Parse the given pattern into its abstract syntax.
@@ -160,11 +172,28 @@ impl ConfiguredHIR {
non_matching_bytes(&self.expr)
}
/// Returns true if and only if this regex needs to have its match offsets
/// tweaked because of CRLF support. Specifically, this occurs when the
/// CRLF hack is enabled and the regex is line anchored at the end. In
/// this case, matches that end with a `\r` have the `\r` stripped.
pub fn needs_crlf_stripped(&self) -> bool {
self.config.crlf && self.expr.is_line_anchored_end()
}
/// Builds a regular expression from this HIR expression.
pub fn regex(&self) -> Result<Regex, Error> {
self.pattern_to_regex(&self.expr.to_string())
}
/// If this HIR corresponds to an alternation of literals with no
/// capturing groups, then this returns those literals.
pub fn alternation_literals(&self) -> Option<Vec<Vec<u8>>> {
if !self.config.can_plain_aho_corasick() {
return None;
}
alternation_literals(&self.expr)
}
/// Applies the given function to the concrete syntax of this HIR and then
/// generates a new HIR based on the result of the function in a way that
/// preserves the configuration.
@@ -199,7 +228,7 @@ impl ConfiguredHIR {
if self.config.line_terminator.is_none() {
return Ok(None);
}
match LiteralSets::new(&self.expr).one_regex() {
match LiteralSets::new(&self.expr).one_regex(self.config.word) {
None => Ok(None),
Some(pattern) => self.pattern_to_regex(&pattern).map(Some),
}

View File

@@ -1,5 +1,112 @@
use std::collections::HashMap;
use grep_matcher::{Match, Matcher, NoError};
use regex::bytes::Regex;
use regex_syntax::hir::{self, Hir, HirKind};
use config::ConfiguredHIR;
use error::Error;
use matcher::RegexCaptures;
/// A matcher for implementing "word match" semantics.
#[derive(Clone, Debug)]
pub struct CRLFMatcher {
/// The regex.
regex: Regex,
/// A map from capture group name to capture group index.
names: HashMap<String, usize>,
}
impl CRLFMatcher {
/// Create a new matcher from the given pattern that strips `\r` from the
/// end of every match.
///
/// This panics if the given expression doesn't need its CRLF stripped.
pub fn new(expr: &ConfiguredHIR) -> Result<CRLFMatcher, Error> {
assert!(expr.needs_crlf_stripped());
let regex = expr.regex()?;
let mut names = HashMap::new();
for (i, optional_name) in regex.capture_names().enumerate() {
if let Some(name) = optional_name {
names.insert(name.to_string(), i.checked_sub(1).unwrap());
}
}
Ok(CRLFMatcher { regex, names })
}
/// Return the underlying regex used by this matcher.
pub fn regex(&self) -> &Regex {
&self.regex
}
}
impl Matcher for CRLFMatcher {
type Captures = RegexCaptures;
type Error = NoError;
fn find_at(
&self,
haystack: &[u8],
at: usize,
) -> Result<Option<Match>, NoError> {
let m = match self.regex.find_at(haystack, at) {
None => return Ok(None),
Some(m) => Match::new(m.start(), m.end()),
};
Ok(Some(adjust_match(haystack, m)))
}
fn new_captures(&self) -> Result<RegexCaptures, NoError> {
Ok(RegexCaptures::new(self.regex.capture_locations()))
}
fn capture_count(&self) -> usize {
self.regex.captures_len().checked_sub(1).unwrap()
}
fn capture_index(&self, name: &str) -> Option<usize> {
self.names.get(name).map(|i| *i)
}
fn captures_at(
&self,
haystack: &[u8],
at: usize,
caps: &mut RegexCaptures,
) -> Result<bool, NoError> {
caps.strip_crlf(false);
let r = self.regex.captures_read_at(
caps.locations_mut(), haystack, at,
);
if !r.is_some() {
return Ok(false);
}
// If the end of our match includes a `\r`, then strip it from all
// capture groups ending at the same location.
let end = caps.locations().get(0).unwrap().1;
if end > 0 && haystack.get(end - 1) == Some(&b'\r') {
caps.strip_crlf(true);
}
Ok(true)
}
// We specifically do not implement other methods like find_iter or
// captures_iter. Namely, the iter methods are guaranteed to be correct
// by virtue of implementing find_at and captures_at above.
}
/// If the given match ends with a `\r`, then return a new match that ends
/// immediately before the `\r`.
pub fn adjust_match(haystack: &[u8], m: Match) -> Match {
if m.end() > 0 && haystack.get(m.end() - 1) == Some(&b'\r') {
m.with_end(m.end() - 1)
} else {
m
}
}
/// Substitutes all occurrences of multi-line enabled `$` with `(?:\r?$)`.
///
/// This does not preserve the exact semantics of the given expression,

View File

@@ -4,6 +4,7 @@ An implementation of `grep-matcher`'s `Matcher` trait for Rust's regex engine.
#![deny(missing_docs)]
extern crate aho_corasick;
extern crate grep_matcher;
#[macro_use]
extern crate log;
@@ -21,6 +22,7 @@ mod crlf;
mod error;
mod literal;
mod matcher;
mod multi;
mod non_matching;
mod strip;
mod util;

View File

@@ -47,18 +47,23 @@ impl LiteralSets {
/// generated these literal sets. The idea here is that the pattern
/// returned by this method is much cheaper to search for. i.e., It is
/// usually a single literal or an alternation of literals.
pub fn one_regex(&self) -> Option<String> {
pub fn one_regex(&self, word: bool) -> Option<String> {
// TODO: The logic in this function is basically inscrutable. It grew
// organically in the old grep 0.1 crate. Ideally, it would be
// re-worked. In fact, the entire inner literal extraction should be
// re-worked. Actually, most of regex-syntax's literal extraction
// should also be re-worked. Alas... only so much time in the day.
if self.prefixes.all_complete() && !self.prefixes.is_empty() {
debug!("literal prefixes detected: {:?}", self.prefixes);
// When this is true, the regex engine will do a literal scan,
// so we don't need to return anything.
return None;
if !word {
if self.prefixes.all_complete() && !self.prefixes.is_empty() {
debug!("literal prefixes detected: {:?}", self.prefixes);
// When this is true, the regex engine will do a literal scan,
// so we don't need to return anything. But we only do this
// if we aren't doing a word regex, since a word regex adds
// a `(?:\W|^)` to the beginning of the regex, thereby
// defeating the regex engine's literal detection.
return None;
}
}
// Out of inner required literals, prefixes and suffixes, which one
@@ -166,10 +171,10 @@ fn union_required(expr: &Hir, lits: &mut Literals) {
lits.cut();
continue;
}
if lits2.contains_empty() {
if lits2.contains_empty() || !is_simple(&e) {
lits.cut();
}
if !lits.cross_product(&lits2) {
if !lits.cross_product(&lits2) || !lits2.any_complete() {
// If this expression couldn't yield any literal that
// could be extended, then we need to quit. Since we're
// short-circuiting, we also need to freeze every member.
@@ -250,6 +255,20 @@ fn alternate_literals<F: FnMut(&Hir, &mut Literals)>(
}
}
fn is_simple(expr: &Hir) -> bool {
match *expr.kind() {
HirKind::Empty
| HirKind::Literal(_)
| HirKind::Class(_)
| HirKind::Repetition(_)
| HirKind::Concat(_)
| HirKind::Alternation(_) => true,
HirKind::Anchor(_)
| HirKind::WordBoundary(_)
| HirKind::Group(_) => false,
}
}
/// Return the number of characters in the given class.
fn count_unicode_class(cls: &hir::ClassUnicode) -> u32 {
cls.iter().map(|r| 1 + (r.end() as u32 - r.start() as u32)).sum()
@@ -271,7 +290,7 @@ mod tests {
}
fn one_regex(pattern: &str) -> Option<String> {
sets(pattern).one_regex()
sets(pattern).one_regex(false)
}
// Put a pattern into the same format as the one returned by `one_regex`.
@@ -301,4 +320,12 @@ mod tests {
// assert_eq!(one_regex(r"\w(foo|bar|baz)"), pat("foo|bar|baz"));
// assert_eq!(one_regex(r"\w(foo|bar|baz)\w"), pat("foo|bar|baz"));
}
#[test]
fn regression_1064() {
// Regression from:
// https://github.com/BurntSushi/ripgrep/issues/1064
// assert_eq!(one_regex(r"a.*c"), pat("a"));
assert_eq!(one_regex(r"a(.*c)"), pat("a"));
}
}

View File

@@ -6,7 +6,9 @@ use grep_matcher::{
use regex::bytes::{CaptureLocations, Regex};
use config::{Config, ConfiguredHIR};
use crlf::CRLFMatcher;
use error::Error;
use multi::MultiLiteralMatcher;
use word::WordMatcher;
/// A builder for constructing a `Matcher` using regular expressions.
@@ -49,14 +51,40 @@ impl RegexMatcherBuilder {
if let Some(ref re) = fast_line_regex {
trace!("extracted fast line regex: {:?}", re);
}
let matcher = RegexMatcherImpl::new(&chir)?;
trace!("final regex: {:?}", matcher.regex());
Ok(RegexMatcher {
config: self.config.clone(),
matcher: RegexMatcherImpl::new(&chir)?,
matcher: matcher,
fast_line_regex: fast_line_regex,
non_matching_bytes: non_matching_bytes,
})
}
/// Build a new matcher from a plain alternation of literals.
///
/// Depending on the configuration set by the builder, this may be able to
/// build a matcher substantially faster than by joining the patterns with
/// a `|` and calling `build`.
pub fn build_literals<B: AsRef<str>>(
&self,
literals: &[B],
) -> Result<RegexMatcher, Error> {
let slices: Vec<_> = literals.iter().map(|s| s.as_ref()).collect();
if !self.config.can_plain_aho_corasick() || literals.len() < 40 {
return self.build(&slices.join("|"));
}
let matcher = MultiLiteralMatcher::new(&slices)?;
let imp = RegexMatcherImpl::MultiLiteral(matcher);
Ok(RegexMatcher {
config: self.config.clone(),
matcher: imp,
fast_line_regex: None,
non_matching_bytes: ByteSet::empty(),
})
}
/// Set the value for the case insensitive (`i`) flag.
///
/// When enabled, letters in the pattern will match both upper case and
@@ -323,8 +351,15 @@ impl RegexMatcher {
/// Create a new matcher from the given pattern using the default
/// configuration, but matches lines terminated by `\n`.
///
/// This returns an error if the given pattern contains a literal `\n`.
/// Other uses of `\n` (such as in `\s`) are removed transparently.
/// This is meant to be a convenience constructor for using a
/// `RegexMatcherBuilder` and setting its
/// [`line_terminator`](struct.RegexMatcherBuilder.html#method.line_terminator)
/// to `\n`. The purpose of using this constructor is to permit special
/// optimizations that help speed up line oriented search. These types of
/// optimizations are only appropriate when matches span no more than one
/// line. For this reason, this constructor will return an error if the
/// given pattern contains a literal `\n`. Other uses of `\n` (such as in
/// `\s`) are removed transparently.
pub fn new_line_matcher(pattern: &str) -> Result<RegexMatcher, Error> {
RegexMatcherBuilder::new()
.line_terminator(Some(b'\n'))
@@ -337,6 +372,13 @@ impl RegexMatcher {
enum RegexMatcherImpl {
/// The standard matcher used for all regular expressions.
Standard(StandardMatcher),
/// A matcher for an alternation of plain literals.
MultiLiteral(MultiLiteralMatcher),
/// A matcher that strips `\r` from the end of matches.
///
/// This is only used when the CRLF hack is enabled and the regex is line
/// anchored at the end.
CRLF(CRLFMatcher),
/// A matcher that only matches at word boundaries. This transforms the
/// regex to `(^|\W)(...)($|\W)` instead of the more intuitive `\b(...)\b`.
/// Because of this, the WordMatcher provides its own implementation of
@@ -351,10 +393,28 @@ impl RegexMatcherImpl {
fn new(expr: &ConfiguredHIR) -> Result<RegexMatcherImpl, Error> {
if expr.config().word {
Ok(RegexMatcherImpl::Word(WordMatcher::new(expr)?))
} else if expr.needs_crlf_stripped() {
Ok(RegexMatcherImpl::CRLF(CRLFMatcher::new(expr)?))
} else {
if let Some(lits) = expr.alternation_literals() {
if lits.len() >= 40 {
let matcher = MultiLiteralMatcher::new(&lits)?;
return Ok(RegexMatcherImpl::MultiLiteral(matcher));
}
}
Ok(RegexMatcherImpl::Standard(StandardMatcher::new(expr)?))
}
}
/// Return the underlying regex object used.
fn regex(&self) -> String {
match *self {
RegexMatcherImpl::Word(ref x) => x.regex().to_string(),
RegexMatcherImpl::CRLF(ref x) => x.regex().to_string(),
RegexMatcherImpl::MultiLiteral(_) => "<N/A>".to_string(),
RegexMatcherImpl::Standard(ref x) => x.regex.to_string(),
}
}
}
// This implementation just dispatches on the internal matcher impl except
@@ -372,6 +432,8 @@ impl Matcher for RegexMatcher {
use self::RegexMatcherImpl::*;
match self.matcher {
Standard(ref m) => m.find_at(haystack, at),
MultiLiteral(ref m) => m.find_at(haystack, at),
CRLF(ref m) => m.find_at(haystack, at),
Word(ref m) => m.find_at(haystack, at),
}
}
@@ -380,6 +442,8 @@ impl Matcher for RegexMatcher {
use self::RegexMatcherImpl::*;
match self.matcher {
Standard(ref m) => m.new_captures(),
MultiLiteral(ref m) => m.new_captures(),
CRLF(ref m) => m.new_captures(),
Word(ref m) => m.new_captures(),
}
}
@@ -388,6 +452,8 @@ impl Matcher for RegexMatcher {
use self::RegexMatcherImpl::*;
match self.matcher {
Standard(ref m) => m.capture_count(),
MultiLiteral(ref m) => m.capture_count(),
CRLF(ref m) => m.capture_count(),
Word(ref m) => m.capture_count(),
}
}
@@ -396,6 +462,8 @@ impl Matcher for RegexMatcher {
use self::RegexMatcherImpl::*;
match self.matcher {
Standard(ref m) => m.capture_index(name),
MultiLiteral(ref m) => m.capture_index(name),
CRLF(ref m) => m.capture_index(name),
Word(ref m) => m.capture_index(name),
}
}
@@ -404,6 +472,8 @@ impl Matcher for RegexMatcher {
use self::RegexMatcherImpl::*;
match self.matcher {
Standard(ref m) => m.find(haystack),
MultiLiteral(ref m) => m.find(haystack),
CRLF(ref m) => m.find(haystack),
Word(ref m) => m.find(haystack),
}
}
@@ -418,6 +488,8 @@ impl Matcher for RegexMatcher {
use self::RegexMatcherImpl::*;
match self.matcher {
Standard(ref m) => m.find_iter(haystack, matched),
MultiLiteral(ref m) => m.find_iter(haystack, matched),
CRLF(ref m) => m.find_iter(haystack, matched),
Word(ref m) => m.find_iter(haystack, matched),
}
}
@@ -432,6 +504,8 @@ impl Matcher for RegexMatcher {
use self::RegexMatcherImpl::*;
match self.matcher {
Standard(ref m) => m.try_find_iter(haystack, matched),
MultiLiteral(ref m) => m.try_find_iter(haystack, matched),
CRLF(ref m) => m.try_find_iter(haystack, matched),
Word(ref m) => m.try_find_iter(haystack, matched),
}
}
@@ -444,6 +518,8 @@ impl Matcher for RegexMatcher {
use self::RegexMatcherImpl::*;
match self.matcher {
Standard(ref m) => m.captures(haystack, caps),
MultiLiteral(ref m) => m.captures(haystack, caps),
CRLF(ref m) => m.captures(haystack, caps),
Word(ref m) => m.captures(haystack, caps),
}
}
@@ -459,6 +535,8 @@ impl Matcher for RegexMatcher {
use self::RegexMatcherImpl::*;
match self.matcher {
Standard(ref m) => m.captures_iter(haystack, caps, matched),
MultiLiteral(ref m) => m.captures_iter(haystack, caps, matched),
CRLF(ref m) => m.captures_iter(haystack, caps, matched),
Word(ref m) => m.captures_iter(haystack, caps, matched),
}
}
@@ -474,6 +552,10 @@ impl Matcher for RegexMatcher {
use self::RegexMatcherImpl::*;
match self.matcher {
Standard(ref m) => m.try_captures_iter(haystack, caps, matched),
MultiLiteral(ref m) => {
m.try_captures_iter(haystack, caps, matched)
}
CRLF(ref m) => m.try_captures_iter(haystack, caps, matched),
Word(ref m) => m.try_captures_iter(haystack, caps, matched),
}
}
@@ -487,6 +569,8 @@ impl Matcher for RegexMatcher {
use self::RegexMatcherImpl::*;
match self.matcher {
Standard(ref m) => m.captures_at(haystack, at, caps),
MultiLiteral(ref m) => m.captures_at(haystack, at, caps),
CRLF(ref m) => m.captures_at(haystack, at, caps),
Word(ref m) => m.captures_at(haystack, at, caps),
}
}
@@ -502,6 +586,8 @@ impl Matcher for RegexMatcher {
use self::RegexMatcherImpl::*;
match self.matcher {
Standard(ref m) => m.replace(haystack, dst, append),
MultiLiteral(ref m) => m.replace(haystack, dst, append),
CRLF(ref m) => m.replace(haystack, dst, append),
Word(ref m) => m.replace(haystack, dst, append),
}
}
@@ -520,6 +606,12 @@ impl Matcher for RegexMatcher {
Standard(ref m) => {
m.replace_with_captures(haystack, caps, dst, append)
}
MultiLiteral(ref m) => {
m.replace_with_captures(haystack, caps, dst, append)
}
CRLF(ref m) => {
m.replace_with_captures(haystack, caps, dst, append)
}
Word(ref m) => {
m.replace_with_captures(haystack, caps, dst, append)
}
@@ -530,6 +622,8 @@ impl Matcher for RegexMatcher {
use self::RegexMatcherImpl::*;
match self.matcher {
Standard(ref m) => m.is_match(haystack),
MultiLiteral(ref m) => m.is_match(haystack),
CRLF(ref m) => m.is_match(haystack),
Word(ref m) => m.is_match(haystack),
}
}
@@ -542,6 +636,8 @@ impl Matcher for RegexMatcher {
use self::RegexMatcherImpl::*;
match self.matcher {
Standard(ref m) => m.is_match_at(haystack, at),
MultiLiteral(ref m) => m.is_match_at(haystack, at),
CRLF(ref m) => m.is_match_at(haystack, at),
Word(ref m) => m.is_match_at(haystack, at),
}
}
@@ -553,6 +649,8 @@ impl Matcher for RegexMatcher {
use self::RegexMatcherImpl::*;
match self.matcher {
Standard(ref m) => m.shortest_match(haystack),
MultiLiteral(ref m) => m.shortest_match(haystack),
CRLF(ref m) => m.shortest_match(haystack),
Word(ref m) => m.shortest_match(haystack),
}
}
@@ -565,6 +663,8 @@ impl Matcher for RegexMatcher {
use self::RegexMatcherImpl::*;
match self.matcher {
Standard(ref m) => m.shortest_match_at(haystack, at),
MultiLiteral(ref m) => m.shortest_match_at(haystack, at),
CRLF(ref m) => m.shortest_match_at(haystack, at),
Word(ref m) => m.shortest_match_at(haystack, at),
}
}
@@ -664,7 +764,9 @@ impl Matcher for StandardMatcher {
at: usize,
caps: &mut RegexCaptures,
) -> Result<bool, NoError> {
Ok(self.regex.captures_read_at(&mut caps.locs, haystack, at).is_some())
Ok(self.regex.captures_read_at(
&mut caps.locations_mut(), haystack, at,
).is_some())
}
fn shortest_match_at(
@@ -691,34 +793,84 @@ impl Matcher for StandardMatcher {
/// index of the group using the corresponding matcher's `capture_index`
/// method, and then use that index with `RegexCaptures::get`.
#[derive(Clone, Debug)]
pub struct RegexCaptures {
/// Where the locations are stored.
locs: CaptureLocations,
/// These captures behave as if the capturing groups begin at the given
/// offset. When set to `0`, this has no affect and capture groups are
/// indexed like normal.
///
/// This is useful when building matchers that wrap arbitrary regular
/// expressions. For example, `WordMatcher` takes an existing regex `re`
/// and creates `(?:^|\W)(re)(?:$|\W)`, but hides the fact that the regex
/// has been wrapped from the caller. In order to do this, the matcher
/// and the capturing groups must behave as if `(re)` is the `0`th capture
/// group.
offset: usize,
pub struct RegexCaptures(RegexCapturesImp);
#[derive(Clone, Debug)]
enum RegexCapturesImp {
AhoCorasick {
/// The start and end of the match, corresponding to capture group 0.
mat: Option<Match>,
},
Regex {
/// Where the locations are stored.
locs: CaptureLocations,
/// These captures behave as if the capturing groups begin at the given
/// offset. When set to `0`, this has no affect and capture groups are
/// indexed like normal.
///
/// This is useful when building matchers that wrap arbitrary regular
/// expressions. For example, `WordMatcher` takes an existing regex
/// `re` and creates `(?:^|\W)(re)(?:$|\W)`, but hides the fact that
/// the regex has been wrapped from the caller. In order to do this,
/// the matcher and the capturing groups must behave as if `(re)` is
/// the `0`th capture group.
offset: usize,
/// When enable, the end of a match has `\r` stripped from it, if one
/// exists.
strip_crlf: bool,
},
}
impl Captures for RegexCaptures {
fn len(&self) -> usize {
self.locs.len().checked_sub(self.offset).unwrap()
match self.0 {
RegexCapturesImp::AhoCorasick { .. } => 1,
RegexCapturesImp::Regex { ref locs, offset, .. } => {
locs.len().checked_sub(offset).unwrap()
}
}
}
fn get(&self, i: usize) -> Option<Match> {
let actual = i.checked_add(self.offset).unwrap();
self.locs.pos(actual).map(|(s, e)| Match::new(s, e))
match self.0 {
RegexCapturesImp::AhoCorasick { mat, .. } => {
if i == 0 {
mat
} else {
None
}
}
RegexCapturesImp::Regex { ref locs, offset, strip_crlf } => {
if !strip_crlf {
let actual = i.checked_add(offset).unwrap();
return locs.pos(actual).map(|(s, e)| Match::new(s, e));
}
// currently don't support capture offsetting with CRLF
// stripping
assert_eq!(offset, 0);
let m = match locs.pos(i).map(|(s, e)| Match::new(s, e)) {
None => return None,
Some(m) => m,
};
// If the end position of this match corresponds to the end
// position of the overall match, then we apply our CRLF
// stripping. Otherwise, we cannot assume stripping is correct.
if i == 0 || m.end() == locs.pos(0).unwrap().1 {
Some(m.with_end(m.end() - 1))
} else {
Some(m)
}
}
}
}
}
impl RegexCaptures {
pub(crate) fn simple() -> RegexCaptures {
RegexCaptures(RegexCapturesImp::AhoCorasick { mat: None })
}
pub(crate) fn new(locs: CaptureLocations) -> RegexCaptures {
RegexCaptures::with_offset(locs, 0)
}
@@ -727,11 +879,53 @@ impl RegexCaptures {
locs: CaptureLocations,
offset: usize,
) -> RegexCaptures {
RegexCaptures { locs, offset }
RegexCaptures(RegexCapturesImp::Regex {
locs, offset, strip_crlf: false,
})
}
pub(crate) fn locations(&mut self) -> &mut CaptureLocations {
&mut self.locs
pub(crate) fn locations(&self) -> &CaptureLocations {
match self.0 {
RegexCapturesImp::AhoCorasick { .. } => {
panic!("getting locations for simple captures is invalid")
}
RegexCapturesImp::Regex { ref locs, .. } => {
locs
}
}
}
pub(crate) fn locations_mut(&mut self) -> &mut CaptureLocations {
match self.0 {
RegexCapturesImp::AhoCorasick { .. } => {
panic!("getting locations for simple captures is invalid")
}
RegexCapturesImp::Regex { ref mut locs, .. } => {
locs
}
}
}
pub(crate) fn strip_crlf(&mut self, yes: bool) {
match self.0 {
RegexCapturesImp::AhoCorasick { .. } => {
panic!("setting strip_crlf for simple captures is invalid")
}
RegexCapturesImp::Regex { ref mut strip_crlf, .. } => {
*strip_crlf = yes;
}
}
}
pub(crate) fn set_simple(&mut self, one: Option<Match>) {
match self.0 {
RegexCapturesImp::AhoCorasick { ref mut mat } => {
*mat = one;
}
RegexCapturesImp::Regex { .. } => {
panic!("setting simple captures for regex is invalid")
}
}
}
}

127
grep-regex/src/multi.rs Normal file
View File

@@ -0,0 +1,127 @@
use aho_corasick::{AhoCorasick, AhoCorasickBuilder, MatchKind};
use grep_matcher::{Matcher, Match, NoError};
use regex_syntax::hir::Hir;
use error::Error;
use matcher::RegexCaptures;
/// A matcher for an alternation of literals.
///
/// Ideally, this optimization would be pushed down into the regex engine, but
/// making this work correctly there would require quite a bit of refactoring.
/// Moreover, doing it one layer above lets us do thing like, "if we
/// specifically only want to search for literals, then don't bother with
/// regex parsing at all."
#[derive(Clone, Debug)]
pub struct MultiLiteralMatcher {
/// The Aho-Corasick automaton.
ac: AhoCorasick,
}
impl MultiLiteralMatcher {
/// Create a new multi-literal matcher from the given literals.
pub fn new<B: AsRef<[u8]>>(
literals: &[B],
) -> Result<MultiLiteralMatcher, Error> {
let ac = AhoCorasickBuilder::new()
.match_kind(MatchKind::LeftmostFirst)
.auto_configure(literals)
.build_with_size::<usize, _, _>(literals)
.map_err(Error::regex)?;
Ok(MultiLiteralMatcher { ac })
}
}
impl Matcher for MultiLiteralMatcher {
type Captures = RegexCaptures;
type Error = NoError;
fn find_at(
&self,
haystack: &[u8],
at: usize,
) -> Result<Option<Match>, NoError> {
match self.ac.find(&haystack[at..]) {
None => Ok(None),
Some(m) => Ok(Some(Match::new(at + m.start(), at + m.end()))),
}
}
fn new_captures(&self) -> Result<RegexCaptures, NoError> {
Ok(RegexCaptures::simple())
}
fn capture_count(&self) -> usize {
1
}
fn capture_index(&self, _: &str) -> Option<usize> {
None
}
fn captures_at(
&self,
haystack: &[u8],
at: usize,
caps: &mut RegexCaptures,
) -> Result<bool, NoError> {
caps.set_simple(None);
let mat = self.find_at(haystack, at)?;
caps.set_simple(mat);
Ok(mat.is_some())
}
// We specifically do not implement other methods like find_iter. Namely,
// the iter methods are guaranteed to be correct by virtue of implementing
// find_at above.
}
/// Alternation literals checks if the given HIR is a simple alternation of
/// literals, and if so, returns them. Otherwise, this returns None.
pub fn alternation_literals(expr: &Hir) -> Option<Vec<Vec<u8>>> {
use regex_syntax::hir::{HirKind, Literal};
// This is pretty hacky, but basically, if `is_alternation_literal` is
// true, then we can make several assumptions about the structure of our
// HIR. This is what justifies the `unreachable!` statements below.
if !expr.is_alternation_literal() {
return None;
}
let alts = match *expr.kind() {
HirKind::Alternation(ref alts) => alts,
_ => return None, // one literal isn't worth it
};
let extendlit = |lit: &Literal, dst: &mut Vec<u8>| {
match *lit {
Literal::Unicode(c) => {
let mut buf = [0; 4];
dst.extend_from_slice(c.encode_utf8(&mut buf).as_bytes());
}
Literal::Byte(b) => {
dst.push(b);
}
}
};
let mut lits = vec![];
for alt in alts {
let mut lit = vec![];
match *alt.kind() {
HirKind::Empty => {}
HirKind::Literal(ref x) => extendlit(x, &mut lit),
HirKind::Concat(ref exprs) => {
for e in exprs {
match *e.kind() {
HirKind::Literal(ref x) => extendlit(x, &mut lit),
_ => unreachable!("expected literal, got {:?}", e),
}
}
}
_ => unreachable!("expected literal or concat, got {:?}", alt),
}
lits.push(lit);
}
Some(lits)
}

View File

@@ -55,6 +55,11 @@ impl WordMatcher {
}
Ok(WordMatcher { regex, names, locs })
}
/// Return the underlying regex used by this matcher.
pub fn regex(&self) -> &Regex {
&self.regex
}
}
impl Matcher for WordMatcher {
@@ -98,7 +103,9 @@ impl Matcher for WordMatcher {
at: usize,
caps: &mut RegexCaptures,
) -> Result<bool, NoError> {
let r = self.regex.captures_read_at(caps.locations(), haystack, at);
let r = self.regex.captures_read_at(
caps.locations_mut(), haystack, at,
);
Ok(r.is_some())
}

View File

@@ -1,6 +1,6 @@
[package]
name = "grep-searcher"
version = "0.1.0" #:version
version = "0.1.5" #:version
authors = ["Andrew Gallant <jamslam@gmail.com>"]
description = """
Fast line oriented regex searching as a library.
@@ -13,23 +13,21 @@ keywords = ["regex", "grep", "egrep", "search", "pattern"]
license = "Unlicense/MIT"
[dependencies]
bytecount = "0.3.1"
encoding_rs = "0.8"
encoding_rs_io = "0.1.2"
grep-matcher = { version = "0.1.0", path = "../grep-matcher" }
log = "0.4"
memchr = "2"
memmap = "0.6"
bstr = { version = "0.2.0", default-features = false, features = ["std"] }
bytecount = "0.5"
encoding_rs = "0.8.14"
encoding_rs_io = "0.1.6"
grep-matcher = { version = "0.1.2", path = "../grep-matcher" }
log = "0.4.5"
memmap = "0.7"
[dev-dependencies]
grep-regex = { version = "0.1.0", path = "../grep-regex" }
regex = "1"
grep-regex = { version = "0.1.3", path = "../grep-regex" }
regex = "1.1"
[features]
avx-accel = [
"bytecount/avx-accel",
]
simd-accel = [
"bytecount/simd-accel",
"encoding_rs/simd-accel",
]
default = ["bytecount/runtime-dispatch-simd"]
simd-accel = ["encoding_rs/simd-accel"]
# This feature is DEPRECATED. Runtime dispatch is used for SIMD now.
avx-accel = []

View File

@@ -17,7 +17,7 @@ fn main() {
}
}
fn example() -> Result<(), Box<Error>> {
fn example() -> Result<(), Box<dyn Error>> {
let pattern = match env::args().nth(1) {
Some(pattern) => pattern,
None => return Err(From::from(format!(

View File

@@ -74,14 +74,11 @@ fn example() -> Result<(), Box<Error>> {
let mut matches: Vec<(u64, String)> = vec![];
Searcher::new().search_slice(&matcher, SHERLOCK, UTF8(|lnum, line| {
// We are guaranteed to find a match, so the unwrap is OK.
eprintln!("LINE: {:?}", line);
let mymatch = matcher.find(line.as_bytes())?.unwrap();
matches.push((lnum, line[mymatch].to_string()));
Ok(true)
}))?;
eprintln!("MATCHES: {:?}", matches);
assert_eq!(matches.len(), 2);
assert_eq!(
matches[0],
@@ -102,13 +99,13 @@ searches stdin.
#![deny(missing_docs)]
extern crate bstr;
extern crate bytecount;
extern crate encoding_rs;
extern crate encoding_rs_io;
extern crate grep_matcher;
#[macro_use]
extern crate log;
extern crate memchr;
extern crate memmap;
#[cfg(test)]
extern crate regex;

View File

@@ -1,8 +1,7 @@
use std::cmp;
use std::io;
use std::ptr;
use memchr::{memchr, memrchr};
use bstr::ByteSlice;
/// The default buffer capacity that we use for the line buffer.
pub(crate) const DEFAULT_BUFFER_CAPACITY: usize = 8 * (1<<10); // 8 KB
@@ -258,6 +257,13 @@ impl<'b, R: io::Read> LineBufferReader<'b, R> {
self.line_buffer.buffer()
}
/// Return the buffer as a BStr, used for convenient equality checking
/// in tests only.
#[cfg(test)]
fn bstr(&self) -> &::bstr::BStr {
self.buffer().as_bstr()
}
/// Consume the number of bytes provided. This must be less than or equal
/// to the number of bytes returned by `buffer`.
pub fn consume(&mut self, amt: usize) {
@@ -312,6 +318,14 @@ pub struct LineBuffer {
}
impl LineBuffer {
/// Set the binary detection method used on this line buffer.
///
/// This permits dynamically changing the binary detection strategy on
/// an existing line buffer without needing to create a new one.
pub fn set_binary_detection(&mut self, binary: BinaryDetection) {
self.config.binary = binary;
}
/// Reset this buffer, such that it can be used with a new reader.
fn clear(&mut self) {
self.pos = 0;
@@ -396,7 +410,7 @@ impl LineBuffer {
assert_eq!(self.pos, 0);
loop {
self.ensure_capacity()?;
let readlen = rdr.read(self.free_buffer())?;
let readlen = rdr.read(self.free_buffer().as_bytes_mut())?;
if readlen == 0 {
// We're only done reading for good once the caller has
// consumed everything.
@@ -416,7 +430,7 @@ impl LineBuffer {
match self.config.binary {
BinaryDetection::None => {} // nothing to do
BinaryDetection::Quit(byte) => {
if let Some(i) = memchr(byte, newbytes) {
if let Some(i) = newbytes.find_byte(byte) {
self.end = oldend + i;
self.last_lineterm = self.end;
self.binary_byte_offset =
@@ -444,7 +458,7 @@ impl LineBuffer {
}
// Update our `last_lineterm` positions if we read one.
if let Some(i) = memrchr(self.config.lineterm, newbytes) {
if let Some(i) = newbytes.rfind_byte(self.config.lineterm) {
self.last_lineterm = oldend + i + 1;
return Ok(true);
}
@@ -467,40 +481,8 @@ impl LineBuffer {
return;
}
assert!(self.pos < self.end && self.end <= self.buf.len());
let roll_len = self.end - self.pos;
unsafe {
// SAFETY: A buffer contains Copy data, so there's no problem
// moving it around. Safety also depends on our indices being
// in bounds, which they should always be, and we enforce with
// an assert above.
//
// It seems like it should be possible to do this in safe code that
// results in the same codegen. I tried the obvious:
//
// for (src, dst) in (self.pos..self.end).zip(0..) {
// self.buf[dst] = self.buf[src];
// }
//
// But the above does not work, and in fact compiles down to a slow
// byte-by-byte loop. I tried a few other minor variations, but
// alas, better minds might prevail.
//
// Overall, this doesn't save us *too* much. It mostly matters when
// the number of bytes we're copying is large, which can happen
// if the searcher is asked to produce a lot of context. We could
// decide this isn't worth it, but it does make an appreciable
// impact at or around the context=30 range on my machine.
//
// We could also use a temporary buffer that compiles down to two
// memcpys and is faster than the byte-at-a-time loop, but it
// complicates our options for limiting memory allocation a bit.
ptr::copy(
self.buf[self.pos..].as_ptr(),
self.buf.as_mut_ptr(),
roll_len,
);
}
self.buf.copy_within_str(self.pos.., 0);
self.pos = 0;
self.last_lineterm = roll_len;
self.end = roll_len;
@@ -536,14 +518,15 @@ impl LineBuffer {
}
}
/// Replaces `src` with `replacement` in bytes.
/// Replaces `src` with `replacement` in bytes, and return the offset of the
/// first replacement, if one exists.
fn replace_bytes(bytes: &mut [u8], src: u8, replacement: u8) -> Option<usize> {
if src == replacement {
return None;
}
let mut first_pos = None;
let mut pos = 0;
while let Some(i) = memchr(src, &bytes[pos..]).map(|i| pos + i) {
while let Some(i) = bytes[pos..].find_byte(src).map(|i| pos + i) {
if first_pos.is_none() {
first_pos = Some(i);
}
@@ -560,6 +543,7 @@ fn replace_bytes(bytes: &mut [u8], src: u8, replacement: u8) -> Option<usize> {
#[cfg(test)]
mod tests {
use std::str;
use bstr::{ByteSlice, ByteVec};
use super::*;
const SHERLOCK: &'static str = "\
@@ -575,18 +559,14 @@ and exhibited clearly, with a label attached.\
slice.to_string()
}
fn btos(slice: &[u8]) -> &str {
str::from_utf8(slice).unwrap()
}
fn replace_str(
slice: &str,
src: u8,
replacement: u8,
) -> (String, Option<usize>) {
let mut dst = slice.to_string().into_bytes();
let mut dst = Vec::from(slice);
let result = replace_bytes(&mut dst, src, replacement);
(String::from_utf8(dst).unwrap(), result)
(dst.into_string().unwrap(), result)
}
#[test]
@@ -607,7 +587,7 @@ and exhibited clearly, with a label attached.\
assert!(rdr.buffer().is_empty());
assert!(rdr.fill().unwrap());
assert_eq!(btos(rdr.buffer()), "homer\nlisa\n");
assert_eq!(rdr.bstr(), "homer\nlisa\n");
assert_eq!(rdr.absolute_byte_offset(), 0);
rdr.consume(5);
assert_eq!(rdr.absolute_byte_offset(), 5);
@@ -615,7 +595,7 @@ and exhibited clearly, with a label attached.\
assert_eq!(rdr.absolute_byte_offset(), 11);
assert!(rdr.fill().unwrap());
assert_eq!(btos(rdr.buffer()), "maggie");
assert_eq!(rdr.bstr(), "maggie");
rdr.consume_all();
assert!(!rdr.fill().unwrap());
@@ -630,7 +610,7 @@ and exhibited clearly, with a label attached.\
let mut rdr = LineBufferReader::new(bytes.as_bytes(), &mut linebuf);
assert!(rdr.fill().unwrap());
assert_eq!(btos(rdr.buffer()), "homer\nlisa\nmaggie\n");
assert_eq!(rdr.bstr(), "homer\nlisa\nmaggie\n");
rdr.consume_all();
assert!(!rdr.fill().unwrap());
@@ -645,7 +625,7 @@ and exhibited clearly, with a label attached.\
let mut rdr = LineBufferReader::new(bytes.as_bytes(), &mut linebuf);
assert!(rdr.fill().unwrap());
assert_eq!(btos(rdr.buffer()), "\n");
assert_eq!(rdr.bstr(), "\n");
rdr.consume_all();
assert!(!rdr.fill().unwrap());
@@ -660,7 +640,7 @@ and exhibited clearly, with a label attached.\
let mut rdr = LineBufferReader::new(bytes.as_bytes(), &mut linebuf);
assert!(rdr.fill().unwrap());
assert_eq!(btos(rdr.buffer()), "\n\n");
assert_eq!(rdr.bstr(), "\n\n");
rdr.consume_all();
assert!(!rdr.fill().unwrap());
@@ -700,10 +680,10 @@ and exhibited clearly, with a label attached.\
let mut got = vec![];
while rdr.fill().unwrap() {
got.extend(rdr.buffer());
got.push_str(rdr.buffer());
rdr.consume_all();
}
assert_eq!(bytes, btos(&got));
assert_eq!(bytes, got.as_bstr());
assert_eq!(rdr.absolute_byte_offset(), bytes.len() as u64);
assert_eq!(rdr.binary_byte_offset(), None);
}
@@ -718,11 +698,11 @@ and exhibited clearly, with a label attached.\
let mut rdr = LineBufferReader::new(bytes.as_bytes(), &mut linebuf);
assert!(rdr.fill().unwrap());
assert_eq!(btos(rdr.buffer()), "homer\n");
assert_eq!(rdr.bstr(), "homer\n");
rdr.consume_all();
assert!(rdr.fill().unwrap());
assert_eq!(btos(rdr.buffer()), "lisa\n");
assert_eq!(rdr.bstr(), "lisa\n");
rdr.consume_all();
// This returns an error because while we have just enough room to
@@ -732,11 +712,11 @@ and exhibited clearly, with a label attached.\
assert!(rdr.fill().is_err());
// We can mush on though!
assert_eq!(btos(rdr.buffer()), "m");
assert_eq!(rdr.bstr(), "m");
rdr.consume_all();
assert!(rdr.fill().unwrap());
assert_eq!(btos(rdr.buffer()), "aggie");
assert_eq!(rdr.bstr(), "aggie");
rdr.consume_all();
assert!(!rdr.fill().unwrap());
@@ -752,16 +732,16 @@ and exhibited clearly, with a label attached.\
let mut rdr = LineBufferReader::new(bytes.as_bytes(), &mut linebuf);
assert!(rdr.fill().unwrap());
assert_eq!(btos(rdr.buffer()), "homer\n");
assert_eq!(rdr.bstr(), "homer\n");
rdr.consume_all();
assert!(rdr.fill().unwrap());
assert_eq!(btos(rdr.buffer()), "lisa\n");
assert_eq!(rdr.bstr(), "lisa\n");
rdr.consume_all();
// We have just enough space.
assert!(rdr.fill().unwrap());
assert_eq!(btos(rdr.buffer()), "maggie");
assert_eq!(rdr.bstr(), "maggie");
rdr.consume_all();
assert!(!rdr.fill().unwrap());
@@ -777,7 +757,7 @@ and exhibited clearly, with a label attached.\
let mut rdr = LineBufferReader::new(bytes.as_bytes(), &mut linebuf);
assert!(rdr.fill().is_err());
assert_eq!(btos(rdr.buffer()), "");
assert_eq!(rdr.bstr(), "");
}
#[test]
@@ -789,7 +769,7 @@ and exhibited clearly, with a label attached.\
assert!(rdr.buffer().is_empty());
assert!(rdr.fill().unwrap());
assert_eq!(btos(rdr.buffer()), "homer\nli\x00sa\nmaggie\n");
assert_eq!(rdr.bstr(), "homer\nli\x00sa\nmaggie\n");
rdr.consume_all();
assert!(!rdr.fill().unwrap());
@@ -808,7 +788,7 @@ and exhibited clearly, with a label attached.\
assert!(rdr.buffer().is_empty());
assert!(rdr.fill().unwrap());
assert_eq!(btos(rdr.buffer()), "homer\nli");
assert_eq!(rdr.bstr(), "homer\nli");
rdr.consume_all();
assert!(!rdr.fill().unwrap());
@@ -825,7 +805,7 @@ and exhibited clearly, with a label attached.\
let mut rdr = LineBufferReader::new(bytes.as_bytes(), &mut linebuf);
assert!(!rdr.fill().unwrap());
assert_eq!(btos(rdr.buffer()), "");
assert_eq!(rdr.bstr(), "");
assert_eq!(rdr.absolute_byte_offset(), 0);
assert_eq!(rdr.binary_byte_offset(), Some(0));
}
@@ -841,7 +821,7 @@ and exhibited clearly, with a label attached.\
assert!(rdr.buffer().is_empty());
assert!(rdr.fill().unwrap());
assert_eq!(btos(rdr.buffer()), "homer\nlisa\nmaggie\n");
assert_eq!(rdr.bstr(), "homer\nlisa\nmaggie\n");
rdr.consume_all();
assert!(!rdr.fill().unwrap());
@@ -860,7 +840,7 @@ and exhibited clearly, with a label attached.\
assert!(rdr.buffer().is_empty());
assert!(rdr.fill().unwrap());
assert_eq!(btos(rdr.buffer()), "homer\nlisa\nmaggie");
assert_eq!(rdr.bstr(), "homer\nlisa\nmaggie");
rdr.consume_all();
assert!(!rdr.fill().unwrap());
@@ -878,7 +858,7 @@ and exhibited clearly, with a label attached.\
assert!(rdr.buffer().is_empty());
assert!(rdr.fill().unwrap());
assert_eq!(btos(rdr.buffer()), "\
assert_eq!(rdr.bstr(), "\
For the Doctor Watsons of this world, as opposed to the Sherlock
Holmeses, s\
");
@@ -901,7 +881,7 @@ Holmeses, s\
assert!(rdr.buffer().is_empty());
assert!(rdr.fill().unwrap());
assert_eq!(btos(rdr.buffer()), "homer\nli\nsa\nmaggie\n");
assert_eq!(rdr.bstr(), "homer\nli\nsa\nmaggie\n");
rdr.consume_all();
assert!(!rdr.fill().unwrap());
@@ -920,7 +900,7 @@ Holmeses, s\
assert!(rdr.buffer().is_empty());
assert!(rdr.fill().unwrap());
assert_eq!(btos(rdr.buffer()), "\nhomer\nlisa\nmaggie\n");
assert_eq!(rdr.bstr(), "\nhomer\nlisa\nmaggie\n");
rdr.consume_all();
assert!(!rdr.fill().unwrap());
@@ -939,7 +919,7 @@ Holmeses, s\
assert!(rdr.buffer().is_empty());
assert!(rdr.fill().unwrap());
assert_eq!(btos(rdr.buffer()), "homer\nlisa\nmaggie\n\n");
assert_eq!(rdr.bstr(), "homer\nlisa\nmaggie\n\n");
rdr.consume_all();
assert!(!rdr.fill().unwrap());
@@ -958,7 +938,7 @@ Holmeses, s\
assert!(rdr.buffer().is_empty());
assert!(rdr.fill().unwrap());
assert_eq!(btos(rdr.buffer()), "homer\nlisa\nmaggie\n\n");
assert_eq!(rdr.bstr(), "homer\nlisa\nmaggie\n\n");
rdr.consume_all();
assert!(!rdr.fill().unwrap());

View File

@@ -2,8 +2,8 @@
A collection of routines for performing operations on lines.
*/
use bstr::ByteSlice;
use bytecount;
use memchr::{memchr, memrchr};
use grep_matcher::{LineTerminator, Match};
/// An iterator over lines in a particular slice of bytes.
@@ -85,7 +85,7 @@ impl LineStep {
#[inline(always)]
fn next_impl(&mut self, mut bytes: &[u8]) -> Option<(usize, usize)> {
bytes = &bytes[..self.end];
match memchr(self.line_term, &bytes[self.pos..]) {
match bytes[self.pos..].find_byte(self.line_term) {
None => {
if self.pos < bytes.len() {
let m = (self.pos, bytes.len());
@@ -135,14 +135,16 @@ pub fn locate(
line_term: u8,
range: Match,
) -> Match {
let line_start = memrchr(line_term, &bytes[0..range.start()])
let line_start = bytes[..range.start()]
.rfind_byte(line_term)
.map_or(0, |i| i + 1);
let line_end =
if range.end() > line_start && bytes[range.end() - 1] == line_term {
range.end()
} else {
memchr(line_term, &bytes[range.end()..])
.map_or(bytes.len(), |i| range.end() + i + 1)
bytes[range.end()..]
.find_byte(line_term)
.map_or(bytes.len(), |i| range.end() + i + 1)
};
Match::new(line_start, line_end)
}
@@ -180,7 +182,7 @@ fn preceding_by_pos(
pos -= 1;
}
loop {
match memrchr(line_term, &bytes[..pos]) {
match bytes[..pos].rfind_byte(line_term) {
None => {
return 0;
}

View File

@@ -1,3 +1,4 @@
/// Like assert_eq, but nicer output for long strings.
#[cfg(test)]
#[macro_export]
macro_rules! assert_eq_printed {

View File

@@ -1,6 +1,6 @@
use std::cmp;
use memchr::memchr;
use bstr::ByteSlice;
use grep_matcher::{LineMatchKind, Matcher};
use lines::{self, LineStep};
@@ -90,6 +90,13 @@ impl<'s, M: Matcher, S: Sink> Core<'s, M, S> {
self.sink_matched(buf, range)
}
pub fn binary_data(
&mut self,
binary_byte_offset: u64,
) -> Result<bool, S::Error> {
self.sink.binary_data(&self.searcher, binary_byte_offset)
}
pub fn begin(&mut self) -> Result<bool, S::Error> {
self.sink.begin(&self.searcher)
}
@@ -141,19 +148,28 @@ impl<'s, M: Matcher, S: Sink> Core<'s, M, S> {
consumed
}
pub fn detect_binary(&mut self, buf: &[u8], range: &Range) -> bool {
pub fn detect_binary(
&mut self,
buf: &[u8],
range: &Range,
) -> Result<bool, S::Error> {
if self.binary_byte_offset.is_some() {
return true;
return Ok(self.config.binary.quit_byte().is_some());
}
let binary_byte = match self.config.binary.0 {
BinaryDetection::Quit(b) => b,
_ => return false,
BinaryDetection::Convert(b) => b,
_ => return Ok(false),
};
if let Some(i) = memchr(binary_byte, &buf[*range]) {
self.binary_byte_offset = Some(range.start() + i);
true
if let Some(i) = buf[*range].find_byte(binary_byte) {
let offset = range.start() + i;
self.binary_byte_offset = Some(offset);
if !self.binary_data(offset as u64)? {
return Ok(true);
}
Ok(self.config.binary.quit_byte().is_some())
} else {
false
Ok(false)
}
}
@@ -416,7 +432,7 @@ impl<'s, M: Matcher, S: Sink> Core<'s, M, S> {
buf: &[u8],
range: &Range,
) -> Result<bool, S::Error> {
if self.binary && self.detect_binary(buf, range) {
if self.binary && self.detect_binary(buf, range)? {
return Ok(false);
}
if !self.sink_break_context(range.start())? {
@@ -424,16 +440,7 @@ impl<'s, M: Matcher, S: Sink> Core<'s, M, S> {
}
self.count_lines(buf, range.start());
let offset = self.absolute_byte_offset + range.start() as u64;
let linebuf =
if self.config.line_term.is_crlf() {
// Normally, a line terminator is never part of a match, but
// if the line terminator is CRLF, then it's possible for `\r`
// to end up in the match, which we generally don't want. So
// we strip it here.
lines::without_terminator(&buf[*range], self.config.line_term)
} else {
&buf[*range]
};
let linebuf = &buf[*range];
let keepgoing = self.sink.matched(
&self.searcher,
&SinkMatch {
@@ -457,7 +464,7 @@ impl<'s, M: Matcher, S: Sink> Core<'s, M, S> {
buf: &[u8],
range: &Range,
) -> Result<bool, S::Error> {
if self.binary && self.detect_binary(buf, range) {
if self.binary && self.detect_binary(buf, range)? {
return Ok(false);
}
self.count_lines(buf, range.start());
@@ -487,7 +494,7 @@ impl<'s, M: Matcher, S: Sink> Core<'s, M, S> {
) -> Result<bool, S::Error> {
assert!(self.after_context_left >= 1);
if self.binary && self.detect_binary(buf, range) {
if self.binary && self.detect_binary(buf, range)? {
return Ok(false);
}
self.count_lines(buf, range.start());
@@ -516,7 +523,7 @@ impl<'s, M: Matcher, S: Sink> Core<'s, M, S> {
buf: &[u8],
range: &Range,
) -> Result<bool, S::Error> {
if self.binary && self.detect_binary(buf, range) {
if self.binary && self.detect_binary(buf, range)? {
return Ok(false);
}
self.count_lines(buf, range.start());

View File

@@ -51,6 +51,7 @@ where M: Matcher,
fn fill(&mut self) -> Result<bool, S::Error> {
assert!(self.rdr.buffer()[self.core.pos()..].is_empty());
let already_binary = self.rdr.binary_byte_offset().is_some();
let old_buf_len = self.rdr.buffer().len();
let consumed = self.core.roll(self.rdr.buffer());
self.rdr.consume(consumed);
@@ -58,7 +59,14 @@ where M: Matcher,
Err(err) => return Err(S::Error::error_io(err)),
Ok(didread) => didread,
};
if !didread || self.rdr.binary_byte_offset().is_some() {
if !already_binary {
if let Some(offset) = self.rdr.binary_byte_offset() {
if !self.core.binary_data(offset)? {
return Ok(false);
}
}
}
if !didread || self.should_binary_quit() {
return Ok(false);
}
// If rolling the buffer didn't result in consuming anything and if
@@ -71,6 +79,11 @@ where M: Matcher,
}
Ok(true)
}
fn should_binary_quit(&self) -> bool {
self.rdr.binary_byte_offset().is_some()
&& self.config.binary.quit_byte().is_some()
}
}
#[derive(Debug)]
@@ -103,7 +116,7 @@ impl<'s, M: Matcher, S: Sink> SliceByLine<'s, M, S> {
DEFAULT_BUFFER_CAPACITY,
);
let binary_range = Range::new(0, binary_upto);
if !self.core.detect_binary(self.slice, &binary_range) {
if !self.core.detect_binary(self.slice, &binary_range)? {
while
!self.slice[self.core.pos()..].is_empty()
&& self.core.match_by_line(self.slice)?
@@ -155,7 +168,7 @@ impl<'s, M: Matcher, S: Sink> MultiLine<'s, M, S> {
DEFAULT_BUFFER_CAPACITY,
);
let binary_range = Range::new(0, binary_upto);
if !self.core.detect_binary(self.slice, &binary_range) {
if !self.core.detect_binary(self.slice, &binary_range)? {
let mut keepgoing = true;
while !self.slice[self.core.pos()..].is_empty() && keepgoing {
keepgoing = self.sink()?;

View File

@@ -75,25 +75,41 @@ impl BinaryDetection {
BinaryDetection(line_buffer::BinaryDetection::Quit(binary_byte))
}
// TODO(burntsushi): Figure out how to make binary conversion work. This
// permits implementing GNU grep's default behavior, which is to zap NUL
// bytes but still execute a search (if a match is detected, then GNU grep
// stops and reports that a match was found but doesn't print the matching
// line itself).
//
// This behavior is pretty simple to implement using the line buffer (and
// in fact, it is already implemented and tested), since there's a fixed
// size buffer that we can easily write to. The issue arises when searching
// a `&[u8]` (whether on the heap or via a memory map), since this isn't
// something we can easily write to.
/// The given byte is searched in all contents read by the line buffer. If
/// it occurs, then it is replaced by the line terminator. The line buffer
/// guarantees that this byte will never be observable by callers.
#[allow(dead_code)]
fn convert(binary_byte: u8) -> BinaryDetection {
/// Binary detection is performed by looking for the given byte, and
/// replacing it with the line terminator configured on the searcher.
/// (If the searcher is configured to use `CRLF` as the line terminator,
/// then this byte is replaced by just `LF`.)
///
/// When searching is performed using a fixed size buffer, then the
/// contents of that buffer are always searched for the presence of this
/// byte and replaced with the line terminator. In effect, the caller is
/// guaranteed to never observe this byte while searching.
///
/// When searching is performed with the entire contents mapped into
/// memory, then this setting has no effect and is ignored.
pub fn convert(binary_byte: u8) -> BinaryDetection {
BinaryDetection(line_buffer::BinaryDetection::Convert(binary_byte))
}
/// If this binary detection uses the "quit" strategy, then this returns
/// the byte that will cause a search to quit. In any other case, this
/// returns `None`.
pub fn quit_byte(&self) -> Option<u8> {
match self.0 {
line_buffer::BinaryDetection::Quit(b) => Some(b),
_ => None,
}
}
/// If this binary detection uses the "convert" strategy, then this returns
/// the byte that will be replaced by the line terminator. In any other
/// case, this returns `None`.
pub fn convert_byte(&self) -> Option<u8> {
match self.0 {
line_buffer::BinaryDetection::Convert(b) => Some(b),
_ => None,
}
}
}
/// An encoding to use when searching.
@@ -155,6 +171,8 @@ pub struct Config {
/// An encoding that, when present, causes the searcher to transcode all
/// input from the encoding to UTF-8.
encoding: Option<Encoding>,
/// Whether to do automatic transcoding based on a BOM or not.
bom_sniffing: bool,
}
impl Default for Config {
@@ -171,6 +189,7 @@ impl Default for Config {
binary: BinaryDetection::default(),
multi_line: false,
encoding: None,
bom_sniffing: true,
}
}
}
@@ -303,11 +322,15 @@ impl SearcherBuilder {
config.before_context = 0;
config.after_context = 0;
}
let mut decode_builder = DecodeReaderBytesBuilder::new();
decode_builder
.encoding(self.config.encoding.as_ref().map(|e| e.0))
.utf8_passthru(true)
.bom_override(true);
.strip_bom(self.config.bom_sniffing)
.bom_override(true)
.bom_sniffing(self.config.bom_sniffing);
Searcher {
config: config,
decode_builder: decode_builder,
@@ -505,12 +528,13 @@ impl SearcherBuilder {
/// transcoding process encounters an error, then bytes are replaced with
/// the Unicode replacement codepoint.
///
/// When no encoding is specified (the default), then BOM sniffing is used
/// to determine whether the source data is UTF-8 or UTF-16, and
/// transcoding will be performed automatically. If no BOM could be found,
/// then the source data is searched _as if_ it were UTF-8. However, so
/// long as the source data is at least ASCII compatible, then it is
/// possible for a search to produce useful results.
/// When no encoding is specified (the default), then BOM sniffing is
/// used (if it's enabled, which it is, by default) to determine whether
/// the source data is UTF-8 or UTF-16, and transcoding will be performed
/// automatically. If no BOM could be found, then the source data is
/// searched _as if_ it were UTF-8. However, so long as the source data is
/// at least ASCII compatible, then it is possible for a search to produce
/// useful results.
pub fn encoding(
&mut self,
encoding: Option<Encoding>,
@@ -518,6 +542,23 @@ impl SearcherBuilder {
self.config.encoding = encoding;
self
}
/// Enable automatic transcoding based on BOM sniffing.
///
/// When this is enabled and an explicit encoding is not set, then this
/// searcher will try to detect the encoding of the bytes being searched
/// by sniffing its byte-order mark (BOM). In particular, when this is
/// enabled, UTF-16 encoded files will be searched seamlessly.
///
/// When this is disabled and if an explicit encoding is not set, then
/// the bytes from the source stream will be passed through unchanged,
/// including its BOM, if one is present.
///
/// This is enabled by default.
pub fn bom_sniffing(&mut self, yes: bool) -> &mut SearcherBuilder {
self.config.bom_sniffing = yes;
self
}
}
/// A searcher executes searches over a haystack and writes results to a caller
@@ -714,6 +755,12 @@ impl Searcher {
}
}
/// Set the binary detection method used on this searcher.
pub fn set_binary_detection(&mut self, detection: BinaryDetection) {
self.config.binary = detection.clone();
self.line_buffer.borrow_mut().set_binary_detection(detection.0);
}
/// Check that the searcher's configuration and the matcher are consistent
/// with each other.
fn check_config<M: Matcher>(&self, matcher: M) -> Result<(), ConfigError> {
@@ -737,7 +784,8 @@ impl Searcher {
/// Returns true if and only if the given slice needs to be transcoded.
fn slice_needs_transcoding(&self, slice: &[u8]) -> bool {
self.config.encoding.is_some() || slice_has_utf16_bom(slice)
self.config.encoding.is_some()
|| (self.config.bom_sniffing && slice_has_utf16_bom(slice))
}
}
@@ -752,6 +800,12 @@ impl Searcher {
self.config.line_term
}
/// Returns the type of binary detection configured on this searcher.
#[inline]
pub fn binary_detection(&self) -> &BinaryDetection {
&self.config.binary
}
/// Returns true if and only if this searcher is configured to invert its
/// search results. That is, matching lines are lines that do **not** match
/// the searcher's matcher.

View File

@@ -1,3 +1,4 @@
use std::error;
use std::fmt;
use std::io;
@@ -49,9 +50,9 @@ impl SinkError for io::Error {
/// A `Box<std::error::Error>` can be used as an error for `Sink`
/// implementations out of the box.
impl SinkError for Box<::std::error::Error> {
fn error_message<T: fmt::Display>(message: T) -> Box<::std::error::Error> {
Box::<::std::error::Error>::from(message.to_string())
impl SinkError for Box<dyn error::Error> {
fn error_message<T: fmt::Display>(message: T) -> Box<dyn error::Error> {
Box::<dyn error::Error>::from(message.to_string())
}
}
@@ -167,6 +168,28 @@ pub trait Sink {
Ok(true)
}
/// This method is called whenever binary detection is enabled and binary
/// data is found. If binary data is found, then this is called at least
/// once for the first occurrence with the absolute byte offset at which
/// the binary data begins.
///
/// If this returns `true`, then searching continues. If this returns
/// `false`, then searching is stopped immediately and `finish` is called.
///
/// If this returns an error, then searching is stopped immediately,
/// `finish` is not called and the error is bubbled back up to the caller
/// of the searcher.
///
/// By default, it does nothing and returns `true`.
#[inline]
fn binary_data(
&mut self,
_searcher: &Searcher,
_binary_byte_offset: u64,
) -> Result<bool, Self::Error> {
Ok(true)
}
/// This method is called when a search has begun, before any search is
/// executed. By default, this does nothing.
///
@@ -228,6 +251,71 @@ impl<'a, S: Sink> Sink for &'a mut S {
(**self).context_break(searcher)
}
#[inline]
fn binary_data(
&mut self,
searcher: &Searcher,
binary_byte_offset: u64,
) -> Result<bool, S::Error> {
(**self).binary_data(searcher, binary_byte_offset)
}
#[inline]
fn begin(
&mut self,
searcher: &Searcher,
) -> Result<bool, S::Error> {
(**self).begin(searcher)
}
#[inline]
fn finish(
&mut self,
searcher: &Searcher,
sink_finish: &SinkFinish,
) -> Result<(), S::Error> {
(**self).finish(searcher, sink_finish)
}
}
impl<S: Sink + ?Sized> Sink for Box<S> {
type Error = S::Error;
#[inline]
fn matched(
&mut self,
searcher: &Searcher,
mat: &SinkMatch,
) -> Result<bool, S::Error> {
(**self).matched(searcher, mat)
}
#[inline]
fn context(
&mut self,
searcher: &Searcher,
context: &SinkContext,
) -> Result<bool, S::Error> {
(**self).context(searcher, context)
}
#[inline]
fn context_break(
&mut self,
searcher: &Searcher,
) -> Result<bool, S::Error> {
(**self).context_break(searcher)
}
#[inline]
fn binary_data(
&mut self,
searcher: &Searcher,
binary_byte_offset: u64,
) -> Result<bool, S::Error> {
(**self).binary_data(searcher, binary_byte_offset)
}
#[inline]
fn begin(
&mut self,

View File

@@ -1,10 +1,10 @@
use std::io::{self, Write};
use std::str;
use bstr::ByteSlice;
use grep_matcher::{
LineMatchKind, LineTerminator, Match, Matcher, NoCaptures, NoError,
};
use memchr::memchr;
use regex::bytes::{Regex, RegexBuilder};
use searcher::{BinaryDetection, Searcher, SearcherBuilder};
@@ -94,7 +94,8 @@ impl Matcher for RegexMatcher {
}
// Make it interesting and return the last byte in the current
// line.
let i = memchr(self.line_term.unwrap().as_byte(), haystack)
let i = haystack
.find_byte(self.line_term.unwrap().as_byte())
.map(|i| i)
.unwrap_or(haystack.len() - 1);
Ok(Some(LineMatchKind::Candidate(i)))

View File

@@ -1,6 +1,6 @@
[package]
name = "grep"
version = "0.2.0" #:version
version = "0.2.4" #:version
authors = ["Andrew Gallant <jamslam@gmail.com>"]
description = """
Fast line oriented regex searching as a library.
@@ -13,18 +13,20 @@ keywords = ["regex", "grep", "egrep", "search", "pattern"]
license = "Unlicense/MIT"
[dependencies]
grep-matcher = { version = "0.1.0", path = "../grep-matcher" }
grep-pcre2 = { version = "0.1.0", path = "../grep-pcre2", optional = true }
grep-printer = { version = "0.1.0", path = "../grep-printer" }
grep-regex = { version = "0.1.0", path = "../grep-regex" }
grep-searcher = { version = "0.1.0", path = "../grep-searcher" }
grep-cli = { version = "0.1.2", path = "../grep-cli" }
grep-matcher = { version = "0.1.2", path = "../grep-matcher" }
grep-pcre2 = { version = "0.1.3", path = "../grep-pcre2", optional = true }
grep-printer = { version = "0.1.2", path = "../grep-printer" }
grep-regex = { version = "0.1.3", path = "../grep-regex" }
grep-searcher = { version = "0.1.4", path = "../grep-searcher" }
[dev-dependencies]
atty = "0.2.11"
termcolor = "1"
walkdir = "2.2.0"
termcolor = "1.0.4"
walkdir = "2.2.7"
[features]
avx-accel = ["grep-searcher/avx-accel"]
simd-accel = ["grep-searcher/simd-accel"]
pcre2 = ["grep-pcre2"]
# This feature is DEPRECATED. Runtime dispatch is used for SIMD now.
avx-accel = []

View File

@@ -1,29 +1,19 @@
extern crate atty;
extern crate grep;
extern crate termcolor;
extern crate walkdir;
use std::env;
use std::error;
use std::error::Error;
use std::ffi::OsString;
use std::path::Path;
use std::process;
use std::result;
use grep::cli;
use grep::printer::{ColorSpecs, StandardBuilder};
use grep::regex::RegexMatcher;
use grep::searcher::{BinaryDetection, SearcherBuilder};
use termcolor::{ColorChoice, StandardStream};
use termcolor::ColorChoice;
use walkdir::WalkDir;
macro_rules! fail {
($($tt:tt)*) => {
return Err(From::from(format!($($tt)*)));
}
}
type Result<T> = result::Result<T, Box<error::Error>>;
fn main() {
if let Err(err) = try_main() {
eprintln!("{}", err);
@@ -31,45 +21,39 @@ fn main() {
}
}
fn try_main() -> Result<()> {
fn try_main() -> Result<(), Box<dyn Error>> {
let mut args: Vec<OsString> = env::args_os().collect();
if args.len() < 2 {
fail!("Usage: simplegrep <pattern> [<path> ...]");
return Err("Usage: simplegrep <pattern> [<path> ...]".into());
}
if args.len() == 2 {
args.push(OsString::from("./"));
}
let pattern = match args[1].clone().into_string() {
Ok(pattern) => pattern,
Err(_) => {
fail!(
"pattern is not valid UTF-8: '{:?}'",
args[1].to_string_lossy()
);
}
};
search(&pattern, &args[2..])
search(cli::pattern_from_os(&args[1])?, &args[2..])
}
fn search(pattern: &str, paths: &[OsString]) -> Result<()> {
fn search(pattern: &str, paths: &[OsString]) -> Result<(), Box<dyn Error>> {
let matcher = RegexMatcher::new_line_matcher(&pattern)?;
let mut searcher = SearcherBuilder::new()
.binary_detection(BinaryDetection::quit(b'\x00'))
.line_number(false)
.build();
let mut printer = StandardBuilder::new()
.color_specs(colors())
.build(StandardStream::stdout(color_choice()));
.color_specs(ColorSpecs::default_with_color())
.build(cli::stdout(
if cli::is_tty_stdout() {
ColorChoice::Auto
} else {
ColorChoice::Never
}
));
for path in paths {
for result in WalkDir::new(path) {
let dent = match result {
Ok(dent) => dent,
Err(err) => {
eprintln!(
"{}: {}",
err.path().unwrap_or(Path::new("error")).display(),
err,
);
eprintln!("{}", err);
continue;
}
};
@@ -88,20 +72,3 @@ fn search(pattern: &str, paths: &[OsString]) -> Result<()> {
}
Ok(())
}
fn color_choice() -> ColorChoice {
if atty::is(atty::Stream::Stdout) {
ColorChoice::Auto
} else {
ColorChoice::Never
}
}
fn colors() -> ColorSpecs {
ColorSpecs::new(&[
"path:fg:magenta".parse().unwrap(),
"line:fg:green".parse().unwrap(),
"match:fg:red".parse().unwrap(),
"match:style:bold".parse().unwrap(),
])
}

View File

@@ -14,6 +14,7 @@ A cookbook and a guide are planned.
#![deny(missing_docs)]
pub extern crate grep_cli as cli;
pub extern crate grep_matcher as matcher;
#[cfg(feature = "pcre2")]
pub extern crate grep_pcre2 as pcre2;

View File

@@ -1,6 +1,6 @@
[package]
name = "ignore"
version = "0.4.3" #:version
version = "0.4.7" #:version
authors = ["Andrew Gallant <jamslam@gmail.com>"]
description = """
A fast library for efficiently matching ignore files such as `.gitignore`
@@ -18,22 +18,21 @@ name = "ignore"
bench = false
[dependencies]
crossbeam = "0.3"
globset = { version = "0.4.0", path = "../globset" }
lazy_static = "1"
log = "0.4"
memchr = "2"
regex = "1"
same-file = "1"
thread_local = "0.3.2"
walkdir = "2.2.0"
crossbeam-channel = "0.3.6"
globset = { version = "0.4.3", path = "../globset" }
lazy_static = "1.1"
log = "0.4.5"
memchr = "2.1"
regex = "1.1"
same-file = "1.0.4"
thread_local = "0.3.6"
walkdir = "2.2.7"
[target.'cfg(windows)'.dependencies.winapi]
version = "0.3"
features = ["std", "winnt"]
[target.'cfg(windows)'.dependencies.winapi-util]
version = "0.1.2"
[dev-dependencies]
tempdir = "0.3.5"
tempfile = "3.0.5"
[features]
simd-accel = ["globset/simd-accel"]

View File

@@ -1,14 +1,12 @@
extern crate crossbeam;
extern crate crossbeam_channel as channel;
extern crate ignore;
extern crate walkdir;
use std::env;
use std::io::{self, Write};
use std::path::Path;
use std::sync::Arc;
use std::thread;
use crossbeam::sync::MsQueue;
use ignore::WalkBuilder;
use walkdir::WalkDir;
@@ -16,7 +14,7 @@ fn main() {
let mut path = env::args().nth(1).unwrap();
let mut parallel = false;
let mut simple = false;
let queue: Arc<MsQueue<Option<DirEntry>>> = Arc::new(MsQueue::new());
let (tx, rx) = channel::bounded::<DirEntry>(100);
if path == "parallel" {
path = env::args().nth(2).unwrap();
parallel = true;
@@ -25,10 +23,9 @@ fn main() {
simple = true;
}
let stdout_queue = queue.clone();
let stdout_thread = thread::spawn(move || {
let mut stdout = io::BufWriter::new(io::stdout());
while let Some(dent) = stdout_queue.pop() {
for dent in rx {
write_path(&mut stdout, dent.path());
}
});
@@ -36,26 +33,26 @@ fn main() {
if parallel {
let walker = WalkBuilder::new(path).threads(6).build_parallel();
walker.run(|| {
let queue = queue.clone();
let tx = tx.clone();
Box::new(move |result| {
use ignore::WalkState::*;
queue.push(Some(DirEntry::Y(result.unwrap())));
tx.send(DirEntry::Y(result.unwrap())).unwrap();
Continue
})
});
} else if simple {
let walker = WalkDir::new(path);
for result in walker {
queue.push(Some(DirEntry::X(result.unwrap())));
tx.send(DirEntry::X(result.unwrap())).unwrap();
}
} else {
let walker = WalkBuilder::new(path).build();
for result in walker {
queue.push(Some(DirEntry::Y(result.unwrap())));
tx.send(DirEntry::Y(result.unwrap())).unwrap();
}
}
queue.push(None);
drop(tx);
stdout_thread.join().unwrap();
}

View File

@@ -22,6 +22,7 @@ use gitignore::{self, Gitignore, GitignoreBuilder};
use pathutil::{is_hidden, strip_prefix};
use overrides::{self, Override};
use types::{self, Types};
use walk::DirEntry;
use {Error, Match, PartialErrorBuilder};
/// IgnoreMatch represents information about where a match came from when using
@@ -73,6 +74,8 @@ struct IgnoreOptions {
git_ignore: bool,
/// Whether to read .git/info/exclude files.
git_exclude: bool,
/// Whether to ignore files case insensitively
ignore_case_insensitive: bool,
}
/// Ignore is a matcher useful for recursively walking one or more directories.
@@ -194,7 +197,12 @@ impl Ignore {
errs.maybe_push(err);
igtmp.is_absolute_parent = true;
igtmp.absolute_base = Some(absolute_base.clone());
igtmp.has_git = parent.join(".git").exists();
igtmp.has_git =
if self.0.opts.git_ignore {
parent.join(".git").exists()
} else {
false
};
ig = Ignore(Arc::new(igtmp));
compiled.insert(parent.as_os_str().to_os_string(), ig.clone());
}
@@ -225,7 +233,11 @@ impl Ignore {
Gitignore::empty()
} else {
let (m, err) =
create_gitignore(&dir, &self.0.custom_ignore_filenames);
create_gitignore(
&dir,
&self.0.custom_ignore_filenames,
self.0.opts.ignore_case_insensitive,
);
errs.maybe_push(err);
m
};
@@ -233,7 +245,12 @@ impl Ignore {
if !self.0.opts.ignore {
Gitignore::empty()
} else {
let (m, err) = create_gitignore(&dir, &[".ignore"]);
let (m, err) =
create_gitignore(
&dir,
&[".ignore"],
self.0.opts.ignore_case_insensitive,
);
errs.maybe_push(err);
m
};
@@ -241,7 +258,12 @@ impl Ignore {
if !self.0.opts.git_ignore {
Gitignore::empty()
} else {
let (m, err) = create_gitignore(&dir, &[".gitignore"]);
let (m, err) =
create_gitignore(
&dir,
&[".gitignore"],
self.0.opts.ignore_case_insensitive,
);
errs.maybe_push(err);
m
};
@@ -249,10 +271,21 @@ impl Ignore {
if !self.0.opts.git_exclude {
Gitignore::empty()
} else {
let (m, err) = create_gitignore(&dir, &[".git/info/exclude"]);
let (m, err) =
create_gitignore(
&dir,
&[".git/info/exclude"],
self.0.opts.ignore_case_insensitive,
);
errs.maybe_push(err);
m
};
let has_git =
if self.0.opts.git_ignore {
dir.join(".git").exists()
} else {
false
};
let ig = IgnoreInner {
compiled: self.0.compiled.clone(),
dir: dir.to_path_buf(),
@@ -268,7 +301,7 @@ impl Ignore {
git_global_matcher: self.0.git_global_matcher.clone(),
git_ignore_matcher: gi_matcher,
git_exclude_matcher: gi_exclude_matcher,
has_git: dir.join(".git").exists(),
has_git: has_git,
opts: self.0.opts,
};
(ig, errs.into_error_option())
@@ -285,11 +318,23 @@ impl Ignore {
|| has_explicit_ignores
}
/// Like `matched`, but works with a directory entry instead.
pub fn matched_dir_entry<'a>(
&'a self,
dent: &DirEntry,
) -> Match<IgnoreMatch<'a>> {
let m = self.matched(dent.path(), dent.is_dir());
if m.is_none() && self.0.opts.hidden && is_hidden(dent) {
return Match::Ignore(IgnoreMatch::hidden());
}
m
}
/// Returns a match indicating whether the given file path should be
/// ignored or not.
///
/// The match contains information about its origin.
pub fn matched<'a, P: AsRef<Path>>(
fn matched<'a, P: AsRef<Path>>(
&'a self,
path: P,
is_dir: bool,
@@ -330,9 +375,6 @@ impl Ignore {
whitelisted = mat;
}
}
if whitelisted.is_none() && self.0.opts.hidden && is_hidden(path) {
return Match::Ignore(IgnoreMatch::hidden());
}
whitelisted
}
@@ -483,6 +525,7 @@ impl IgnoreBuilder {
git_global: true,
git_ignore: true,
git_exclude: true,
ignore_case_insensitive: false,
},
}
}
@@ -496,7 +539,11 @@ impl IgnoreBuilder {
if !self.opts.git_global {
Gitignore::empty()
} else {
let (gi, err) = Gitignore::global();
let mut builder = GitignoreBuilder::new("");
builder
.case_insensitive(self.opts.ignore_case_insensitive)
.unwrap();
let (gi, err) = builder.build_global();
if let Some(err) = err {
debug!("{}", err);
}
@@ -627,6 +674,17 @@ impl IgnoreBuilder {
self.opts.git_exclude = yes;
self
}
/// Process ignore files case insensitively
///
/// This is disabled by default.
pub fn ignore_case_insensitive(
&mut self,
yes: bool,
) -> &mut IgnoreBuilder {
self.opts.ignore_case_insensitive = yes;
self
}
}
/// Creates a new gitignore matcher for the directory given.
@@ -638,9 +696,11 @@ impl IgnoreBuilder {
pub fn create_gitignore<T: AsRef<OsStr>>(
dir: &Path,
names: &[T],
case_insensitive: bool,
) -> (Gitignore, Option<Error>) {
let mut builder = GitignoreBuilder::new(dir);
let mut errs = PartialErrorBuilder::default();
builder.case_insensitive(case_insensitive).unwrap();
for name in names {
let gipath = dir.join(name.as_ref());
errs.maybe_push_ignore_io(builder.add(gipath));
@@ -661,7 +721,7 @@ mod tests {
use std::io::Write;
use std::path::Path;
use tempdir::TempDir;
use tempfile::{self, TempDir};
use dir::IgnoreBuilder;
use gitignore::Gitignore;
@@ -683,9 +743,13 @@ mod tests {
}
}
fn tmpdir(prefix: &str) -> TempDir {
tempfile::Builder::new().prefix(prefix).tempdir().unwrap()
}
#[test]
fn explicit_ignore() {
let td = TempDir::new("ignore-test-").unwrap();
let td = tmpdir("ignore-test-");
wfile(td.path().join("not-an-ignore"), "foo\n!bar");
let (gi, err) = Gitignore::new(td.path().join("not-an-ignore"));
@@ -700,7 +764,7 @@ mod tests {
#[test]
fn git_exclude() {
let td = TempDir::new("ignore-test-").unwrap();
let td = tmpdir("ignore-test-");
mkdirp(td.path().join(".git/info"));
wfile(td.path().join(".git/info/exclude"), "foo\n!bar");
@@ -713,7 +777,7 @@ mod tests {
#[test]
fn gitignore() {
let td = TempDir::new("ignore-test-").unwrap();
let td = tmpdir("ignore-test-");
mkdirp(td.path().join(".git"));
wfile(td.path().join(".gitignore"), "foo\n!bar");
@@ -726,7 +790,7 @@ mod tests {
#[test]
fn gitignore_no_git() {
let td = TempDir::new("ignore-test-").unwrap();
let td = tmpdir("ignore-test-");
wfile(td.path().join(".gitignore"), "foo\n!bar");
let (ig, err) = IgnoreBuilder::new().build().add_child(td.path());
@@ -738,7 +802,7 @@ mod tests {
#[test]
fn ignore() {
let td = TempDir::new("ignore-test-").unwrap();
let td = tmpdir("ignore-test-");
wfile(td.path().join(".ignore"), "foo\n!bar");
let (ig, err) = IgnoreBuilder::new().build().add_child(td.path());
@@ -750,7 +814,7 @@ mod tests {
#[test]
fn custom_ignore() {
let td = TempDir::new("ignore-test-").unwrap();
let td = tmpdir("ignore-test-");
let custom_ignore = ".customignore";
wfile(td.path().join(custom_ignore), "foo\n!bar");
@@ -766,7 +830,7 @@ mod tests {
// Tests that a custom ignore file will override an .ignore.
#[test]
fn custom_ignore_over_ignore() {
let td = TempDir::new("ignore-test-").unwrap();
let td = tmpdir("ignore-test-");
let custom_ignore = ".customignore";
wfile(td.path().join(".ignore"), "foo");
wfile(td.path().join(custom_ignore), "!foo");
@@ -781,7 +845,7 @@ mod tests {
// Tests that earlier custom ignore files have lower precedence than later.
#[test]
fn custom_ignore_precedence() {
let td = TempDir::new("ignore-test-").unwrap();
let td = tmpdir("ignore-test-");
let custom_ignore1 = ".customignore1";
let custom_ignore2 = ".customignore2";
wfile(td.path().join(custom_ignore1), "foo");
@@ -798,7 +862,7 @@ mod tests {
// Tests that an .ignore will override a .gitignore.
#[test]
fn ignore_over_gitignore() {
let td = TempDir::new("ignore-test-").unwrap();
let td = tmpdir("ignore-test-");
wfile(td.path().join(".gitignore"), "foo");
wfile(td.path().join(".ignore"), "!foo");
@@ -810,7 +874,7 @@ mod tests {
// Tests that exclude has lower precedent than both .ignore and .gitignore.
#[test]
fn exclude_lowest() {
let td = TempDir::new("ignore-test-").unwrap();
let td = tmpdir("ignore-test-");
wfile(td.path().join(".gitignore"), "!foo");
wfile(td.path().join(".ignore"), "!bar");
mkdirp(td.path().join(".git/info"));
@@ -825,8 +889,8 @@ mod tests {
#[test]
fn errored() {
let td = TempDir::new("ignore-test-").unwrap();
wfile(td.path().join(".gitignore"), "f**oo");
let td = tmpdir("ignore-test-");
wfile(td.path().join(".gitignore"), "{foo");
let (_, err) = IgnoreBuilder::new().build().add_child(td.path());
assert!(err.is_some());
@@ -834,9 +898,9 @@ mod tests {
#[test]
fn errored_both() {
let td = TempDir::new("ignore-test-").unwrap();
wfile(td.path().join(".gitignore"), "f**oo");
wfile(td.path().join(".ignore"), "fo**o");
let td = tmpdir("ignore-test-");
wfile(td.path().join(".gitignore"), "{foo");
wfile(td.path().join(".ignore"), "{bar");
let (_, err) = IgnoreBuilder::new().build().add_child(td.path());
assert_eq!(2, partial(err.expect("an error")).len());
@@ -844,9 +908,9 @@ mod tests {
#[test]
fn errored_partial() {
let td = TempDir::new("ignore-test-").unwrap();
let td = tmpdir("ignore-test-");
mkdirp(td.path().join(".git"));
wfile(td.path().join(".gitignore"), "f**oo\nbar");
wfile(td.path().join(".gitignore"), "{foo\nbar");
let (ig, err) = IgnoreBuilder::new().build().add_child(td.path());
assert!(err.is_some());
@@ -855,8 +919,8 @@ mod tests {
#[test]
fn errored_partial_and_ignore() {
let td = TempDir::new("ignore-test-").unwrap();
wfile(td.path().join(".gitignore"), "f**oo\nbar");
let td = tmpdir("ignore-test-");
wfile(td.path().join(".gitignore"), "{foo\nbar");
wfile(td.path().join(".ignore"), "!bar");
let (ig, err) = IgnoreBuilder::new().build().add_child(td.path());
@@ -866,7 +930,7 @@ mod tests {
#[test]
fn not_present_empty() {
let td = TempDir::new("ignore-test-").unwrap();
let td = tmpdir("ignore-test-");
let (_, err) = IgnoreBuilder::new().build().add_child(td.path());
assert!(err.is_none());
@@ -876,7 +940,7 @@ mod tests {
fn stops_at_git_dir() {
// This tests that .gitignore files beyond a .git barrier aren't
// matched, but .ignore files are.
let td = TempDir::new("ignore-test-").unwrap();
let td = tmpdir("ignore-test-");
mkdirp(td.path().join(".git"));
mkdirp(td.path().join("foo/.git"));
wfile(td.path().join(".gitignore"), "foo");
@@ -897,7 +961,7 @@ mod tests {
#[test]
fn absolute_parent() {
let td = TempDir::new("ignore-test-").unwrap();
let td = tmpdir("ignore-test-");
mkdirp(td.path().join(".git"));
mkdirp(td.path().join("foo"));
wfile(td.path().join(".gitignore"), "bar");
@@ -920,7 +984,7 @@ mod tests {
#[test]
fn absolute_parent_anchored() {
let td = TempDir::new("ignore-test-").unwrap();
let td = tmpdir("ignore-test-");
mkdirp(td.path().join(".git"));
mkdirp(td.path().join("src/llvm"));
wfile(td.path().join(".gitignore"), "/llvm/\nfoo");

View File

@@ -69,8 +69,7 @@ impl Glob {
/// Returns true if and only if this glob has a `**/` prefix.
fn has_doublestar_prefix(&self) -> bool {
self.actual.starts_with("**/")
|| (self.actual == "**" && self.is_only_dir)
self.actual.starts_with("**/") || self.actual == "**"
}
}
@@ -127,16 +126,7 @@ impl Gitignore {
/// `$XDG_CONFIG_HOME/git/ignore` is read. If `$XDG_CONFIG_HOME` is not
/// set or is empty, then `$HOME/.config/git/ignore` is used instead.
pub fn global() -> (Gitignore, Option<Error>) {
match gitconfig_excludes_path() {
None => (Gitignore::empty(), None),
Some(path) => {
if !path.is_file() {
(Gitignore::empty(), None)
} else {
Gitignore::new(path)
}
}
}
GitignoreBuilder::new("").build_global()
}
/// Creates a new empty gitignore matcher that never matches anything.
@@ -359,6 +349,36 @@ impl GitignoreBuilder {
})
}
/// Build a global gitignore matcher using the configuration in this
/// builder.
///
/// This consumes ownership of the builder unlike `build` because it
/// must mutate the builder to add the global gitignore globs.
///
/// Note that this ignores the path given to this builder's constructor
/// and instead derives the path automatically from git's global
/// configuration.
pub fn build_global(mut self) -> (Gitignore, Option<Error>) {
match gitconfig_excludes_path() {
None => (Gitignore::empty(), None),
Some(path) => {
if !path.is_file() {
(Gitignore::empty(), None)
} else {
let mut errs = PartialErrorBuilder::default();
errs.maybe_push_ignore_io(self.add(path));
match self.build() {
Ok(gi) => (gi, errs.into_error_option()),
Err(err) => {
errs.push(err);
(Gitignore::empty(), errs.into_error_option())
}
}
}
}
}
}
/// Add each glob from the file path given.
///
/// The file given should be formatted as a `gitignore` file.
@@ -419,6 +439,8 @@ impl GitignoreBuilder {
from: Option<PathBuf>,
mut line: &str,
) -> Result<&mut GitignoreBuilder, Error> {
#![allow(deprecated)]
if line.starts_with("#") {
return Ok(self);
}
@@ -435,7 +457,6 @@ impl GitignoreBuilder {
is_whitelist: false,
is_only_dir: false,
};
let mut literal_separator = false;
let mut is_absolute = false;
if line.starts_with("\\!") || line.starts_with("\\#") {
line = &line[1..];
@@ -450,7 +471,6 @@ impl GitignoreBuilder {
// then the glob can only match the beginning of a path
// (relative to the location of gitignore). We achieve this by
// simply banning wildcards from matching /.
literal_separator = true;
line = &line[1..];
is_absolute = true;
}
@@ -463,16 +483,11 @@ impl GitignoreBuilder {
line = &line[..i];
}
}
// If there is a literal slash, then we note that so that globbing
// doesn't let wildcards match slashes.
glob.actual = line.to_string();
if is_absolute || line.chars().any(|c| c == '/') {
literal_separator = true;
}
// If there was a slash, then this is a glob that must match the entire
// path name. Otherwise, we should let it match anywhere, so use a **/
// prefix.
if !literal_separator {
// If there is a literal slash, then this is a glob that must match the
// entire path name. Otherwise, we should let it match anywhere, so use
// a **/ prefix.
if !is_absolute && !line.chars().any(|c| c == '/') {
// ... but only if we don't already have a **/ prefix.
if !glob.has_doublestar_prefix() {
glob.actual = format!("**/{}", glob.actual);
@@ -486,7 +501,7 @@ impl GitignoreBuilder {
}
let parsed =
GlobBuilder::new(&glob.actual)
.literal_separator(literal_separator)
.literal_separator(true)
.case_insensitive(self.case_insensitive)
.backslash_escape(true)
.build()
@@ -503,12 +518,16 @@ impl GitignoreBuilder {
/// Toggle whether the globs should be matched case insensitively or not.
///
/// When this option is changed, only globs added after the change will be affected.
/// When this option is changed, only globs added after the change will be
/// affected.
///
/// This is disabled by default.
pub fn case_insensitive(
&mut self, yes: bool
&mut self,
yes: bool,
) -> Result<&mut GitignoreBuilder, Error> {
// TODO: This should not return a `Result`. Fix this in the next semver
// release.
self.case_insensitive = yes;
Ok(self)
}
@@ -689,6 +708,9 @@ mod tests {
ignored!(ig39, ROOT, "\\?", "?");
ignored!(ig40, ROOT, "\\*", "*");
ignored!(ig41, ROOT, "\\a", "a");
ignored!(ig42, ROOT, "s*.rs", "sfoo.rs");
ignored!(ig43, ROOT, "**", "foo.rs");
ignored!(ig44, ROOT, "**/**/*", "a/foo.rs");
not_ignored!(ignot1, ROOT, "amonths", "months");
not_ignored!(ignot2, ROOT, "monthsa", "months");
@@ -710,6 +732,7 @@ mod tests {
not_ignored!(ignot16, ROOT, "*\n!**/", "foo", true);
not_ignored!(ignot17, ROOT, "src/*.rs", "src/grep/src/main.rs");
not_ignored!(ignot18, ROOT, "path1/*", "path2/path1/foo");
not_ignored!(ignot19, ROOT, "s*.rs", "src/foo.rs");
fn bytes(s: &str) -> Vec<u8> {
s.to_string().into_bytes()

View File

@@ -46,7 +46,7 @@ See the documentation for `WalkBuilder` for many other options.
#![deny(missing_docs)]
extern crate crossbeam;
extern crate crossbeam_channel as channel;
extern crate globset;
#[macro_use]
extern crate lazy_static;
@@ -56,11 +56,11 @@ extern crate memchr;
extern crate regex;
extern crate same_file;
#[cfg(test)]
extern crate tempdir;
extern crate tempfile;
extern crate thread_local;
extern crate walkdir;
#[cfg(windows)]
extern crate winapi;
extern crate winapi_util;
use std::error;
use std::fmt;

View File

@@ -139,13 +139,16 @@ impl OverrideBuilder {
}
/// Toggle whether the globs should be matched case insensitively or not.
///
///
/// When this option is changed, only globs added after the change will be affected.
///
/// This is disabled by default.
pub fn case_insensitive(
&mut self, yes: bool
&mut self,
yes: bool,
) -> Result<&mut OverrideBuilder, Error> {
// TODO: This should not return a `Result`. Fix this in the next semver
// release.
self.builder.case_insensitive(yes)?;
Ok(self)
}

View File

@@ -1,22 +1,56 @@
use std::ffi::OsStr;
use std::path::Path;
/// Returns true if and only if this file path is considered to be hidden.
use walk::DirEntry;
/// Returns true if and only if this entry is considered to be hidden.
///
/// This only returns true if the base name of the path starts with a `.`.
///
/// On Unix, this implements a more optimized check.
#[cfg(unix)]
pub fn is_hidden<P: AsRef<Path>>(path: P) -> bool {
pub fn is_hidden(dent: &DirEntry) -> bool {
use std::os::unix::ffi::OsStrExt;
if let Some(name) = file_name(path.as_ref()) {
if let Some(name) = file_name(dent.path()) {
name.as_bytes().get(0) == Some(&b'.')
} else {
false
}
}
/// Returns true if and only if this file path is considered to be hidden.
#[cfg(not(unix))]
pub fn is_hidden<P: AsRef<Path>>(path: P) -> bool {
if let Some(name) = file_name(path.as_ref()) {
/// Returns true if and only if this entry is considered to be hidden.
///
/// On Windows, this returns true if one of the following is true:
///
/// * The base name of the path starts with a `.`.
/// * The file attributes have the `HIDDEN` property set.
#[cfg(windows)]
pub fn is_hidden(dent: &DirEntry) -> bool {
use std::os::windows::fs::MetadataExt;
use winapi_util::file;
// This looks like we're doing an extra stat call, but on Windows, the
// directory traverser reuses the metadata retrieved from each directory
// entry and stores it on the DirEntry itself. So this is "free."
if let Ok(md) = dent.metadata() {
if file::is_hidden(md.file_attributes() as u64) {
return true;
}
}
if let Some(name) = file_name(dent.path()) {
name.to_str().map(|s| s.starts_with(".")).unwrap_or(false)
} else {
false
}
}
/// Returns true if and only if this entry is considered to be hidden.
///
/// This only returns true if the base name of the path starts with a `.`.
#[cfg(not(any(unix, windows)))]
pub fn is_hidden(dent: &DirEntry) -> bool {
if let Some(name) = file_name(dent.path()) {
name.to_str().map(|s| s.starts_with(".")).unwrap_or(false)
} else {
false

View File

@@ -103,12 +103,15 @@ const DEFAULT_TYPES: &'static [(&'static str, &'static [&'static str])] = &[
("amake", &["*.mk", "*.bp"]),
("asciidoc", &["*.adoc", "*.asc", "*.asciidoc"]),
("asm", &["*.asm", "*.s", "*.S"]),
("asp", &["*.aspx", "*.aspx.cs", "*.aspx.cs", "*.ascx", "*.ascx.cs", "*.ascx.vb"]),
("avro", &["*.avdl", "*.avpr", "*.avsc"]),
("awk", &["*.awk"]),
("bazel", &["*.bzl", "WORKSPACE", "BUILD"]),
("bazel", &["*.bzl", "WORKSPACE", "BUILD", "BUILD.bazel"]),
("bitbake", &["*.bb", "*.bbappend", "*.bbclass", "*.conf", "*.inc"]),
("bzip2", &["*.bz2"]),
("c", &["*.c", "*.h", "*.H", "*.cats"]),
("brotli", &["*.br"]),
("buildstream", &["*.bst"]),
("bzip2", &["*.bz2", "*.tbz2"]),
("c", &["*.[chH]", "*.[chH].in", "*.cats"]),
("cabal", &["*.cabal"]),
("cbor", &["*.cbor"]),
("ceylon", &["*.ceylon"]),
@@ -118,8 +121,8 @@ const DEFAULT_TYPES: &'static [(&'static str, &'static [&'static str])] = &[
("creole", &["*.creole"]),
("config", &["*.cfg", "*.conf", "*.config", "*.ini"]),
("cpp", &[
"*.C", "*.cc", "*.cpp", "*.cxx",
"*.h", "*.H", "*.hh", "*.hpp", "*.hxx", "*.inl",
"*.[ChH]", "*.cc", "*.[ch]pp", "*.[ch]xx", "*.hh", "*.inl",
"*.[ChH].in", "*.cc.in", "*.[ch]pp.in", "*.[ch]xx.in", "*.hh.in",
]),
("crystal", &["Projectfile", "*.cr"]),
("cs", &["*.cs"]),
@@ -127,7 +130,7 @@ const DEFAULT_TYPES: &'static [(&'static str, &'static [&'static str])] = &[
("cshtml", &["*.cshtml"]),
("css", &["*.css", "*.scss"]),
("csv", &["*.csv"]),
("cython", &["*.pyx"]),
("cython", &["*.pyx", "*.pxi", "*.pxd"]),
("dart", &["*.dart"]),
("d", &["*.d"]),
("dhall", &["*.dhall"]),
@@ -143,9 +146,10 @@ const DEFAULT_TYPES: &'static [(&'static str, &'static [&'static str])] = &[
"*.f90", "*.F90", "*.f95", "*.F95",
]),
("fsharp", &["*.fs", "*.fsx", "*.fsi"]),
("gap", &["*.g", "*.gap", "*.gi", "*.gd", "*.tst"]),
("gn", &["*.gn", "*.gni"]),
("go", &["*.go"]),
("gzip", &["*.gz"]),
("gzip", &["*.gz", "*.tgz"]),
("groovy", &["*.groovy", "*.gradle"]),
("h", &["*.h", "*.hpp"]),
("hbs", &["*.hbs"]),
@@ -153,7 +157,7 @@ const DEFAULT_TYPES: &'static [(&'static str, &'static [&'static str])] = &[
("hs", &["*.hs", "*.lhs"]),
("html", &["*.htm", "*.html", "*.ejs"]),
("idris", &["*.idr", "*.lidr"]),
("java", &["*.java", "*.jsp"]),
("java", &["*.java", "*.jsp", "*.jspx", "*.properties"]),
("jinja", &["*.j2", "*.jinja", "*.jinja2"]),
("js", &[
"*.js", "*.jsx", "*.vue",
@@ -193,14 +197,16 @@ const DEFAULT_TYPES: &'static [(&'static str, &'static [&'static str])] = &[
"OFL-*[0-9]*",
]),
("lisp", &["*.el", "*.jl", "*.lisp", "*.lsp", "*.sc", "*.scm"]),
("lock", &["*.lock", "package-lock.json"]),
("log", &["*.log"]),
("lua", &["*.lua"]),
("lzma", &["*.lzma"]),
("lz4", &["*.lz4"]),
("m4", &["*.ac", "*.m4"]),
("make", &[
"gnumakefile", "Gnumakefile", "GNUmakefile",
"makefile", "Makefile",
"[Gg][Nn][Uu]makefile", "[Mm]akefile",
"[Gg][Nn][Uu]makefile.am", "[Mm]akefile.am",
"[Gg][Nn][Uu]makefile.in", "[Mm]akefile.in",
"*.mk", "*.mak"
]),
("mako", &["*.mako", "*.mao"]),
@@ -213,22 +219,25 @@ const DEFAULT_TYPES: &'static [(&'static str, &'static [&'static str])] = &[
("msbuild", &[
"*.csproj", "*.fsproj", "*.vcxproj", "*.proj", "*.props", "*.targets"
]),
("nim", &["*.nim"]),
("nim", &["*.nim", "*.nimf", "*.nimble", "*.nims"]),
("nix", &["*.nix"]),
("objc", &["*.h", "*.m"]),
("objcpp", &["*.h", "*.mm"]),
("ocaml", &["*.ml", "*.mli", "*.mll", "*.mly"]),
("org", &["*.org"]),
("pascal", &["*.pas", "*.dpr", "*.lpr", "*.pp", "*.inc"]),
("perl", &["*.perl", "*.pl", "*.PL", "*.plh", "*.plx", "*.pm", "*.t"]),
("pdf", &["*.pdf"]),
("php", &["*.php", "*.php3", "*.php4", "*.php5", "*.phtml"]),
("pod", &["*.pod"]),
("postscript", &[".eps", ".ps"]),
("protobuf", &["*.proto"]),
("ps", &["*.cdxml", "*.ps1", "*.ps1xml", "*.psd1", "*.psm1"]),
("puppet", &["*.erb", "*.pp", "*.rb"]),
("purs", &["*.purs"]),
("py", &["*.py"]),
("qmake", &["*.pro", "*.pri", "*.prf"]),
("qml", &["*.qml"]),
("readme", &["README*", "*README"]),
("r", &["*.R", "*.r", "*.Rmd", "*.Rnw"]),
("rdoc", &["*.rdoc"]),
@@ -277,8 +286,9 @@ const DEFAULT_TYPES: &'static [(&'static str, &'static [&'static str])] = &[
]),
("taskpaper", &["*.taskpaper"]),
("tcl", &["*.tcl"]),
("tex", &["*.tex", "*.ltx", "*.cls", "*.sty", "*.bib"]),
("tex", &["*.tex", "*.ltx", "*.cls", "*.sty", "*.bib", "*.dtx", "*.ins"]),
("textile", &["*.textile"]),
("thrift", &["*.thrift"]),
("tf", &["*.tf"]),
("ts", &["*.ts", "*.tsx"]),
("txt", &["*.txt"]),
@@ -292,10 +302,14 @@ const DEFAULT_TYPES: &'static [(&'static str, &'static [&'static str])] = &[
("vimscript", &["*.vim"]),
("wiki", &["*.mediawiki", "*.wiki"]),
("webidl", &["*.idl", "*.webidl", "*.widl"]),
("xml", &["*.xml", "*.xml.dist"]),
("xz", &["*.xz"]),
("xml", &[
"*.xml", "*.xml.dist", "*.dtd", "*.xsl", "*.xslt", "*.xsd", "*.xjb",
"*.rng", "*.sch",
]),
("xz", &["*.xz", "*.txz"]),
("yacc", &["*.y"]),
("yaml", &["*.yaml", "*.yml"]),
("zig", &["*.zig"]),
("zsh", &[
".zshenv", "zshenv",
".zlogin", "zlogin",
@@ -304,6 +318,7 @@ const DEFAULT_TYPES: &'static [(&'static str, &'static [&'static str])] = &[
".zshrc", "zshrc",
"*.zsh",
]),
("zstd", &["*.zst", "*.zstd"]),
];
/// Glob represents a single glob in a set of file type definitions.
@@ -342,6 +357,18 @@ impl<'a> Glob<'a> {
fn unmatched() -> Glob<'a> {
Glob(GlobInner::UnmatchedIgnore)
}
/// Return the file type defintion that matched, if one exists. A file type
/// definition always exists when a specific definition matches a file
/// path.
pub fn file_type_def(&self) -> Option<&FileTypeDef> {
match self {
Glob(GlobInner::UnmatchedIgnore) => None,
Glob(GlobInner::Matched { def, .. }) => {
Some(def)
},
}
}
}
/// A single file type definition.

File diff suppressed because it is too large Load Diff

View File

@@ -1,14 +1,14 @@
class RipgrepBin < Formula
version '0.9.0'
version '11.0.1'
desc "Recursively search directories for a regex pattern."
homepage "https://github.com/BurntSushi/ripgrep"
if OS.mac?
url "https://github.com/BurntSushi/ripgrep/releases/download/#{version}/ripgrep-#{version}-x86_64-apple-darwin.tar.gz"
sha256 "36003ea8b62ad6274dc14140039f448cdf5026827d53cf24dad2d84005557a8c"
sha256 "92446c6b28b6c726f91ad66a03bcd533fc6e1a28ef4b44c27bfe2d49a0f88531"
elsif OS.linux?
url "https://github.com/BurntSushi/ripgrep/releases/download/#{version}/ripgrep-#{version}-x86_64-unknown-linux-musl.tar.gz"
sha256 "2eb4443e58f95051ff76ea036ed1faf940d5a04af4e7ff5a7dbd74576b907e99"
sha256 "ce74cabac9b39b1ad55837ec01e2c670fa7e965772ac2647b209e31ead19008c"
end
conflicts_with "ripgrep"

1
rustfmt.toml Normal file
View File

@@ -0,0 +1 @@
disable_all_formatting = true

33
scripts/copy-examples Executable file
View File

@@ -0,0 +1,33 @@
#!/usr/bin/env python
from __future__ import absolute_import, division, print_function
import argparse
import codecs
import os.path
import re
RE_EACH_CODE_BLOCK = re.compile(
r'(?s)(?:```|\{\{< high rust[^>]+>\}\})[^\n]*\n(.*?)(?:```|\{\{< /high >\}\})' # noqa
)
RE_MARKER = re.compile(r'^(?:# )?//([^/].*)$')
RE_STRIP_COMMENT = re.compile(r'^# ?')
if __name__ == '__main__':
p = argparse.ArgumentParser()
p.add_argument('--rust-file', default='src/cookbook.rs')
p.add_argument('--example-dir', default='grep/examples')
args = p.parse_args()
with codecs.open(args.rust_file, encoding='utf-8') as f:
rustcode = f.read()
for m in RE_EACH_CODE_BLOCK.finditer(rustcode):
lines = m.group(1).splitlines()
marker, codelines = lines[0], lines[1:]
m = RE_MARKER.search(marker)
if m is None:
continue
code = '\n'.join(RE_STRIP_COMMENT.sub('', line) for line in codelines)
fpath = os.path.join(args.example_dir, m.group(1))
with codecs.open(fpath, mode='w+', encoding='utf-8') as f:
print(code, file=f)

View File

@@ -9,22 +9,27 @@
// is where we read clap's configuration from the end user's arguments and turn
// it into a ripgrep-specific configuration type that is not coupled with clap.
use clap::{self, App, AppSettings};
use clap::{self, App, AppSettings, crate_authors, crate_version};
use lazy_static::lazy_static;
const ABOUT: &str = "
ripgrep (rg) recursively searches your current directory for a regex pattern.
By default, ripgrep will respect your `.gitignore` and automatically skip
hidden files/directories and binary files.
By default, ripgrep will respect your .gitignore and automatically skip hidden
files/directories and binary files.
ripgrep's regex engine uses finite automata and guarantees linear time
searching. Because of this, features like backreferences and arbitrary
lookaround are not supported.
ripgrep's default regex engine uses finite automata and guarantees linear
time searching. Because of this, features like backreferences and arbitrary
look-around are not supported. However, if ripgrep is built with PCRE2, then
the --pcre2 flag can be used to enable backreferences and look-around.
ripgrep supports configuration files. Set RIPGREP_CONFIG_PATH to a
configuration file. The file can specify one shell argument per line. Lines
starting with '#' are ignored. For more details, see the man page or the
README.
Tip: to disable all smart filtering and make ripgrep behave a bit more like
classical grep, use 'rg -uuu'.
Project home page: https://github.com/BurntSushi/ripgrep
Use -h for short descriptions and --help for more details.";
@@ -79,7 +84,7 @@ pub fn app() -> App<'static, 'static> {
/// Return the "long" format of ripgrep's version string.
///
/// If a revision hash is given, then it is used. If one isn't given, then
/// the RIPGREP_BUILD_GIT_HASH env var is inspect for it. If that isn't set,
/// the RIPGREP_BUILD_GIT_HASH env var is inspected for it. If that isn't set,
/// then a revision hash is not included in the version string returned.
pub fn long_version(revision_hash: Option<&str>) -> String {
// Do we have a git hash?
@@ -125,7 +130,7 @@ fn compile_cpu_features() -> Vec<&'static str> {
}
/// Returns the relevant CPU features enabled at runtime.
#[cfg(all(ripgrep_runtime_cpu, target_arch = "x86_64"))]
#[cfg(target_arch = "x86_64")]
fn runtime_cpu_features() -> Vec<&'static str> {
// This is kind of a dirty violation of abstraction, since it assumes
// knowledge about what specific SIMD features are being used.
@@ -145,7 +150,7 @@ fn runtime_cpu_features() -> Vec<&'static str> {
}
/// Returns the relevant CPU features enabled at runtime.
#[cfg(not(all(ripgrep_runtime_cpu, target_arch = "x86_64")))]
#[cfg(not(target_arch = "x86_64"))]
fn runtime_cpu_features() -> Vec<&'static str> {
vec![]
}
@@ -536,9 +541,16 @@ pub fn all_args_and_flags() -> Vec<RGArg> {
// The positional arguments must be defined first and in order.
arg_pattern(&mut args);
arg_path(&mut args);
// Flags can be defined in any order, but we do it alphabetically.
// Flags can be defined in any order, but we do it alphabetically. Note
// that each function may define multiple flags. For example,
// `flag_encoding` defines `--encoding` and `--no-encoding`. Most `--no`
// flags are hidden and merely mentioned in the docs of the corresponding
// "positive" flag.
flag_after_context(&mut args);
flag_auto_hybrid_regex(&mut args);
flag_before_context(&mut args);
flag_binary(&mut args);
flag_block_buffered(&mut args);
flag_byte_offset(&mut args);
flag_case_sensitive(&mut args);
flag_color(&mut args);
@@ -564,11 +576,14 @@ pub fn all_args_and_flags() -> Vec<RGArg> {
flag_iglob(&mut args);
flag_ignore_case(&mut args);
flag_ignore_file(&mut args);
flag_ignore_file_case_insensitive(&mut args);
flag_invert_match(&mut args);
flag_json(&mut args);
flag_line_buffered(&mut args);
flag_line_number(&mut args);
flag_line_regexp(&mut args);
flag_max_columns(&mut args);
flag_max_columns_preview(&mut args);
flag_max_count(&mut args);
flag_max_depth(&mut args);
flag_max_filesize(&mut args);
@@ -577,19 +592,23 @@ pub fn all_args_and_flags() -> Vec<RGArg> {
flag_multiline_dotall(&mut args);
flag_no_config(&mut args);
flag_no_ignore(&mut args);
flag_no_ignore_dot(&mut args);
flag_no_ignore_global(&mut args);
flag_no_ignore_messages(&mut args);
flag_no_ignore_parent(&mut args);
flag_no_ignore_vcs(&mut args);
flag_no_messages(&mut args);
flag_no_pcre2_unicode(&mut args);
flag_null(&mut args);
flag_null_data(&mut args);
flag_one_file_system(&mut args);
flag_only_matching(&mut args);
flag_path_separator(&mut args);
flag_passthru(&mut args);
flag_pcre2(&mut args);
flag_pcre2_unicode(&mut args);
flag_pcre2_version(&mut args);
flag_pre(&mut args);
flag_pre_glob(&mut args);
flag_pretty(&mut args);
flag_quiet(&mut args);
flag_regex_size_limit(&mut args);
@@ -598,6 +617,8 @@ pub fn all_args_and_flags() -> Vec<RGArg> {
flag_search_zip(&mut args);
flag_smart_case(&mut args);
flag_sort_files(&mut args);
flag_sort(&mut args);
flag_sortr(&mut args);
flag_stats(&mut args);
flag_text(&mut args);
flag_threads(&mut args);
@@ -632,7 +653,7 @@ will be provided. Namely, the following is equivalent to the above:
let arg = RGArg::positional("pattern", "PATTERN")
.help(SHORT).long_help(LONG)
.required_unless(&[
"file", "files", "regexp", "type-list",
"file", "files", "regexp", "type-list", "pcre2-version",
]);
args.push(arg);
}
@@ -663,6 +684,50 @@ This overrides the --context flag.
args.push(arg);
}
fn flag_auto_hybrid_regex(args: &mut Vec<RGArg>) {
const SHORT: &str = "Dynamically use PCRE2 if necessary.";
const LONG: &str = long!("\
When this flag is used, ripgrep will dynamically choose between supported regex
engines depending on the features used in a pattern. When ripgrep chooses a
regex engine, it applies that choice for every regex provided to ripgrep (e.g.,
via multiple -e/--regexp or -f/--file flags).
As an example of how this flag might behave, ripgrep will attempt to use
its default finite automata based regex engine whenever the pattern can be
successfully compiled with that regex engine. If PCRE2 is enabled and if the
pattern given could not be compiled with the default regex engine, then PCRE2
will be automatically used for searching. If PCRE2 isn't available, then this
flag has no effect because there is only one regex engine to choose from.
In the future, ripgrep may adjust its heuristics for how it decides which
regex engine to use. In general, the heuristics will be limited to a static
analysis of the patterns, and not to any specific runtime behavior observed
while searching files.
The primary downside of using this flag is that it may not always be obvious
which regex engine ripgrep uses, and thus, the match semantics or performance
profile of ripgrep may subtly and unexpectedly change. However, in many cases,
all regex engines will agree on what constitutes a match and it can be nice
to transparently support more advanced regex features like look-around and
backreferences without explicitly needing to enable them.
This flag can be disabled with --no-auto-hybrid-regex.
");
let arg = RGArg::switch("auto-hybrid-regex")
.help(SHORT).long_help(LONG)
.overrides("no-auto-hybrid-regex")
.overrides("pcre2")
.overrides("no-pcre2");
args.push(arg);
let arg = RGArg::switch("no-auto-hybrid-regex")
.hidden()
.overrides("auto-hybrid-regex")
.overrides("pcre2")
.overrides("no-pcre2");
args.push(arg);
}
fn flag_before_context(args: &mut Vec<RGArg>) {
const SHORT: &str = "Show NUM lines before each match.";
const LONG: &str = long!("\
@@ -677,13 +742,98 @@ This overrides the --context flag.
args.push(arg);
}
fn flag_binary(args: &mut Vec<RGArg>) {
const SHORT: &str = "Search binary files.";
const LONG: &str = long!("\
Enabling this flag will cause ripgrep to search binary files. By default,
ripgrep attempts to automatically skip binary files in order to improve the
relevance of results and make the search faster.
Binary files are heuristically detected based on whether they contain a NUL
byte or not. By default (without this flag set), once a NUL byte is seen,
ripgrep will stop searching the file. Usually, NUL bytes occur in the beginning
of most binary files. If a NUL byte occurs after a match, then ripgrep will
still stop searching the rest of the file, but a warning will be printed.
In contrast, when this flag is provided, ripgrep will continue searching a file
even if a NUL byte is found. In particular, if a NUL byte is found then ripgrep
will continue searching until either a match is found or the end of the file is
reached, whichever comes sooner. If a match is found, then ripgrep will stop
and print a warning saying that the search stopped prematurely.
If you want ripgrep to search a file without any special NUL byte handling at
all (and potentially print binary data to stdout), then you should use the
'-a/--text' flag.
The '--binary' flag is a flag for controlling ripgrep's automatic filtering
mechanism. As such, it does not need to be used when searching a file
explicitly or when searching stdin. That is, it is only applicable when
recursively searching a directory.
Note that when the '-u/--unrestricted' flag is provided for a third time, then
this flag is automatically enabled.
This flag can be disabled with '--no-binary'. It overrides the '-a/--text'
flag.
");
let arg = RGArg::switch("binary")
.help(SHORT).long_help(LONG)
.overrides("no-binary")
.overrides("text")
.overrides("no-text");
args.push(arg);
let arg = RGArg::switch("no-binary")
.hidden()
.overrides("binary")
.overrides("text")
.overrides("no-text");
args.push(arg);
}
fn flag_block_buffered(args: &mut Vec<RGArg>) {
const SHORT: &str = "Force block buffering.";
const LONG: &str = long!("\
When enabled, ripgrep will use block buffering. That is, whenever a matching
line is found, it will be written to an in-memory buffer and will not be
written to stdout until the buffer reaches a certain size. This is the default
when ripgrep's stdout is redirected to a pipeline or a file. When ripgrep's
stdout is connected to a terminal, line buffering will be used. Forcing block
buffering can be useful when dumping a large amount of contents to a terminal.
Forceful block buffering can be disabled with --no-block-buffered. Note that
using --no-block-buffered causes ripgrep to revert to its default behavior of
automatically detecting the buffering strategy. To force line buffering, use
the --line-buffered flag.
");
let arg = RGArg::switch("block-buffered")
.help(SHORT).long_help(LONG)
.overrides("no-block-buffered")
.overrides("line-buffered")
.overrides("no-line-buffered");
args.push(arg);
let arg = RGArg::switch("no-block-buffered")
.hidden()
.overrides("block-buffered")
.overrides("line-buffered")
.overrides("no-line-buffered");
args.push(arg);
}
fn flag_byte_offset(args: &mut Vec<RGArg>) {
const SHORT: &str =
"Print the 0-based byte offset for each matching line.";
const LONG: &str = long!("\
Print the 0-based byte offset within the input file
before each line of output. If -o (--only-matching) is
specified, print the offset of the matching part itself.
Print the 0-based byte offset within the input file before each line of output.
If -o (--only-matching) is specified, print the offset of the matching part
itself.
If ripgrep does transcoding, then the byte offset is in terms of the the result
of transcoding and not the original data. This applies similarly to another
transformation on the source, such as decompression or a --pre filter. Note
that when the PCRE2 regex engine is used, then UTF-8 transcoding is done by
default.
");
let arg = RGArg::switch("byte-offset").short("b")
.help(SHORT).long_help(LONG);
@@ -741,17 +891,17 @@ to one of eight choices: red, blue, green, cyan, magenta, yellow, white and
black. Styles are limited to nobold, bold, nointense, intense, nounderline
or underline.
The format of the flag is `{type}:{attribute}:{value}`. `{type}` should be
one of path, line, column or match. `{attribute}` can be fg, bg or style.
`{value}` is either a color (for fg and bg) or a text style. A special format,
`{type}:none`, will clear all color settings for `{type}`.
The format of the flag is '{type}:{attribute}:{value}'. '{type}' should be
one of path, line, column or match. '{attribute}' can be fg, bg or style.
'{value}' is either a color (for fg and bg) or a text style. A special format,
'{type}:none', will clear all color settings for '{type}'.
For example, the following command will change the match color to magenta and
the background color for line numbers to yellow:
rg --colors 'match:fg:magenta' --colors 'line:bg:yellow' foo.
Extended colors can be used for `{value}` when the terminal supports ANSI color
Extended colors can be used for '{value}' when the terminal supports ANSI color
sequences. These are specified as either 'x' (256-color) or 'x,x,x' (24-bit
truecolor) where x is a number between 0 and 255 inclusive. x may be given as
a normal decimal number or a hexadecimal number, which is prefixed by `0x`.
@@ -932,10 +1082,17 @@ fn flag_encoding(args: &mut Vec<RGArg>) {
const LONG: &str = long!("\
Specify the text encoding that ripgrep will use on all files searched. The
default value is 'auto', which will cause ripgrep to do a best effort automatic
detection of encoding on a per-file basis. Other supported values can be found
in the list of labels here:
detection of encoding on a per-file basis. Automatic detection in this case
only applies to files that begin with a UTF-8 or UTF-16 byte-order mark (BOM).
No other automatic detection is performed. One can also specify 'none' which
will then completely disable BOM sniffing and always result in searching the
raw bytes, including a BOM if it's present, regardless of its encoding.
Other supported values can be found in the list of labels here:
https://encoding.spec.whatwg.org/#concept-encoding-get
For more details on encoding and how ripgrep deals with it, see GUIDE.md.
This flag can be disabled with --no-encoding.
");
let arg = RGArg::flag("encoding", "ENCODING").short("E")
@@ -969,7 +1126,7 @@ fn flag_files(args: &mut Vec<RGArg>) {
const SHORT: &str = "Print each file that would be searched.";
const LONG: &str = long!("\
Print each file that would be searched without actually performing the search.
This is useful to determine whether a particular file is being search or not.
This is useful to determine whether a particular file is being searched or not.
");
let arg = RGArg::switch("files")
.help(SHORT).long_help(LONG)
@@ -1161,6 +1318,26 @@ directly on the command line, then used -g instead.
args.push(arg);
}
fn flag_ignore_file_case_insensitive(args: &mut Vec<RGArg>) {
const SHORT: &str = "Process ignore files case insensitively.";
const LONG: &str = long!("\
Process ignore files (.gitignore, .ignore, etc.) case insensitively. Note that
this comes with a performance penalty and is most useful on case insensitive
file systems (such as Windows).
This flag can be disabled with the --no-ignore-file-case-insensitive flag.
");
let arg = RGArg::switch("ignore-file-case-insensitive")
.help(SHORT).long_help(LONG)
.overrides("no-ignore-file-case-insensitive");
args.push(arg);
let arg = RGArg::switch("no-ignore-file-case-insensitive")
.hidden()
.overrides("ignore-file-case-insensitive");
args.push(arg);
}
fn flag_invert_match(args: &mut Vec<RGArg>) {
const SHORT: &str = "Invert matching.";
const LONG: &str = long!("\
@@ -1231,6 +1408,37 @@ The JSON Lines format can be disabled with --no-json.
args.push(arg);
}
fn flag_line_buffered(args: &mut Vec<RGArg>) {
const SHORT: &str = "Force line buffering.";
const LONG: &str = long!("\
When enabled, ripgrep will use line buffering. That is, whenever a matching
line is found, it will be flushed to stdout immediately. This is the default
when ripgrep's stdout is connected to a terminal, but otherwise, ripgrep will
use block buffering, which is typically faster. This flag forces ripgrep to
use line buffering even if it would otherwise use block buffering. This is
typically useful in shell pipelines, e.g.,
'tail -f something.log | rg foo --line-buffered | rg bar'.
Forceful line buffering can be disabled with --no-line-buffered. Note that
using --no-line-buffered causes ripgrep to revert to its default behavior of
automatically detecting the buffering strategy. To force block buffering, use
the --block-buffered flag.
");
let arg = RGArg::switch("line-buffered")
.help(SHORT).long_help(LONG)
.overrides("no-line-buffered")
.overrides("block-buffered")
.overrides("no-block-buffered");
args.push(arg);
let arg = RGArg::switch("no-line-buffered")
.hidden()
.overrides("line-buffered")
.overrides("block-buffered")
.overrides("no-block-buffered");
args.push(arg);
}
fn flag_line_number(args: &mut Vec<RGArg>) {
const SHORT: &str = "Show line numbers.";
const LONG: &str = long!("\
@@ -1282,6 +1490,30 @@ When this flag is omitted or is set to 0, then it has no effect.
args.push(arg);
}
fn flag_max_columns_preview(args: &mut Vec<RGArg>) {
const SHORT: &str = "Print a preview for lines exceeding the limit.";
const LONG: &str = long!("\
When the '--max-columns' flag is used, ripgrep will by default completely
replace any line that is too long with a message indicating that a matching
line was removed. When this flag is combined with '--max-columns', a preview
of the line (corresponding to the limit size) is shown instead, where the part
of the line exceeding the limit is not shown.
If the '--max-columns' flag is not set, then this has no effect.
This flag can be disabled with '--no-max-columns-preview'.
");
let arg = RGArg::switch("max-columns-preview")
.help(SHORT).long_help(LONG)
.overrides("no-max-columns-preview");
args.push(arg);
let arg = RGArg::switch("no-max-columns-preview")
.hidden()
.overrides("max-columns-preview");
args.push(arg);
}
fn flag_max_count(args: &mut Vec<RGArg>) {
const SHORT: &str = "Limit the number of matches.";
const LONG: &str = long!("\
@@ -1457,7 +1689,7 @@ fn flag_no_ignore(args: &mut Vec<RGArg>) {
const SHORT: &str = "Don't respect ignore files.";
const LONG: &str = long!("\
Don't respect ignore files (.gitignore, .ignore, etc.). This implies
--no-ignore-parent and --no-ignore-vcs.
--no-ignore-parent, --no-ignore-dot and --no-ignore-vcs.
This flag can be disabled with the --ignore flag.
");
@@ -1472,6 +1704,24 @@ This flag can be disabled with the --ignore flag.
args.push(arg);
}
fn flag_no_ignore_dot(args: &mut Vec<RGArg>) {
const SHORT: &str = "Don't respect .ignore files.";
const LONG: &str = long!("\
Don't respect .ignore files.
This flag can be disabled with the --ignore-dot flag.
");
let arg = RGArg::switch("no-ignore-dot")
.help(SHORT).long_help(LONG)
.overrides("ignore-dot");
args.push(arg);
let arg = RGArg::switch("ignore-dot")
.hidden()
.overrides("no-ignore-dot");
args.push(arg);
}
fn flag_no_ignore_global(args: &mut Vec<RGArg>) {
const SHORT: &str = "Don't respect global ignore files.";
const LONG: &str = long!("\
@@ -1568,6 +1818,48 @@ This flag can be disabled with the --messages flag.
args.push(arg);
}
fn flag_no_pcre2_unicode(args: &mut Vec<RGArg>) {
const SHORT: &str = "Disable Unicode mode for PCRE2 matching.";
const LONG: &str = long!("\
When PCRE2 matching is enabled, this flag will disable Unicode mode, which is
otherwise enabled by default. If PCRE2 matching is not enabled, then this flag
has no effect.
When PCRE2's Unicode mode is enabled, several different types of patterns
become Unicode aware. This includes '\\b', '\\B', '\\w', '\\W', '\\d', '\\D',
'\\s' and '\\S'. Similarly, the '.' meta character will match any Unicode
codepoint instead of any byte. Caseless matching will also use Unicode simple
case folding instead of ASCII-only case insensitivity.
Unicode mode in PCRE2 represents a critical trade off in the user experience
of ripgrep. In particular, unlike the default regex engine, PCRE2 does not
support the ability to search possibly invalid UTF-8 with Unicode features
enabled. Instead, PCRE2 *requires* that everything it searches when Unicode
mode is enabled is valid UTF-8. (Or valid UTF-16/UTF-32, but for the purposes
of ripgrep, we only discuss UTF-8.) This means that if you have PCRE2's Unicode
mode enabled and you attempt to search invalid UTF-8, then the search for that
file will halt and print an error. For this reason, when PCRE2's Unicode mode
is enabled, ripgrep will automatically \"fix\" invalid UTF-8 sequences by
replacing them with the Unicode replacement codepoint.
If you would rather see the encoding errors surfaced by PCRE2 when Unicode mode
is enabled, then pass the --no-encoding flag to disable all transcoding.
Related flags: --pcre2
This flag can be disabled with --pcre2-unicode.
");
let arg = RGArg::switch("no-pcre2-unicode")
.help(SHORT).long_help(LONG)
.overrides("pcre2-unicode");
args.push(arg);
let arg = RGArg::switch("pcre2-unicode")
.hidden()
.overrides("no-pcre2-unicode");
args.push(arg);
}
fn flag_null(args: &mut Vec<RGArg>) {
const SHORT: &str = "Print a NUL byte after file paths.";
const LONG: &str = long!("\
@@ -1604,6 +1896,33 @@ Using this flag implies -a/--text.
args.push(arg);
}
fn flag_one_file_system(args: &mut Vec<RGArg>) {
const SHORT: &str =
"Do not descend into directories on other file systems.";
const LONG: &str = long!("\
When enabled, ripgrep will not cross file system boundaries relative to where
the search started from.
Note that this applies to each path argument given to ripgrep. For example, in
the command 'rg --one-file-system /foo/bar /quux/baz', ripgrep will search both
'/foo/bar' and '/quux/baz' even if they are on different file systems, but will
not cross a file system boundary when traversing each path's directory tree.
This is similar to find's '-xdev' or '-mount' flag.
This flag can be disabled with --no-one-file-system.
");
let arg = RGArg::switch("one-file-system")
.help(SHORT).long_help(LONG)
.overrides("no-one-file-system");
args.push(arg);
let arg = RGArg::switch("no-one-file-system")
.hidden()
.overrides("one-file-system");
args.push(arg);
}
fn flag_only_matching(args: &mut Vec<RGArg>) {
const SHORT: &str = "Print only matches parts of a line.";
const LONG: &str = long!("\
@@ -1658,56 +1977,126 @@ Note that PCRE2 is an optional ripgrep feature. If PCRE2 wasn't included in
your build of ripgrep, then using this flag will result in ripgrep printing
an error message and exiting.
Related flags: --no-pcre2-unicode
This flag can be disabled with --no-pcre2.
");
let arg = RGArg::switch("pcre2").short("P")
.help(SHORT).long_help(LONG)
.overrides("no-pcre2");
.overrides("no-pcre2")
.overrides("auto-hybrid-regex")
.overrides("no-auto-hybrid-regex");
args.push(arg);
let arg = RGArg::switch("no-pcre2")
.hidden()
.overrides("pcre2");
.overrides("pcre2")
.overrides("auto-hybrid-regex")
.overrides("no-auto-hybrid-regex");
args.push(arg);
}
fn flag_pcre2_unicode(args: &mut Vec<RGArg>) {
const SHORT: &str = "Enable Unicode mode for PCRE2 matching.";
fn flag_pcre2_version(args: &mut Vec<RGArg>) {
const SHORT: &str = "Print the version of PCRE2 that ripgrep uses.";
const LONG: &str = long!("\
When PCRE2 matching is enabled, this flag will enable Unicode mode. If PCRE2
matching is not enabled, then this flag has no effect.
This flag is enabled by default when PCRE2 matching is enabled.
When PCRE2's Unicode mode is enabled several different types of patterns become
Unicode aware. This includes '\\b', '\\B', '\\w', '\\W', '\\d', '\\D', '\\s'
and '\\S'. Similarly, the '.' meta character will match any Unicode codepoint
instead of any byte. Caseless matching will also use Unicode simple case
folding instead of ASCII-only case insensitivity.
Unicode mode in PCRE2 represents a critical trade off in the user experience
of ripgrep. In particular, unlike the default regex engine, PCRE2 does not
support the ability to search possibly invalid UTF-8 with Unicode features
enabled. Instead, PCRE2 *requires* that everything it searches when Unicode
mode is enabled is valid UTF-8. (Or valid UTF-16/UTF-32, but for the purposes
of ripgrep, we only discuss UTF-8.) This means that if you have PCRE2's Unicode
mode enabled and you attempt to search invalid UTF-8, then the search for that
file will halt and print an error. For this reason, when PCRE2's Unicode mode
is enabled, ripgrep will automatically \"fix\" invalid UTF-8 sequences by
replacing them with the Unicode replacement codepoint.
If you would rather see the encoding errors surfaced by PCRE2 when Unicode mode
is enabled, then pass the --no-encoding flag to disable all transcoding.
This flag can be disabled with --no-pcre2-unicode.
When this flag is present, ripgrep will print the version of PCRE2 in use,
along with other information, and then exit. If PCRE2 is not available, then
ripgrep will print an error message and exit with an error code.
");
let arg = RGArg::switch("pcre2-unicode")
let arg = RGArg::switch("pcre2-version")
.help(SHORT).long_help(LONG);
args.push(arg);
}
let arg = RGArg::switch("no-pcre2-unicode")
fn flag_pre(args: &mut Vec<RGArg>) {
const SHORT: &str = "search outputs of COMMAND FILE for each FILE";
const LONG: &str = long!("\
For each input FILE, search the standard output of COMMAND FILE rather than the
contents of FILE. This option expects the COMMAND program to either be an
absolute path or to be available in your PATH. Either an empty string COMMAND
or the '--no-pre' flag will disable this behavior.
WARNING: When this flag is set, ripgrep will unconditionally spawn a
process for every file that is searched. Therefore, this can incur an
unnecessarily large performance penalty if you don't otherwise need the
flexibility offered by this flag. One possible mitigation to this is to use
the '--pre-glob' flag to limit which files a preprocessor is run with.
A preprocessor is not run when ripgrep is searching stdin.
When searching over sets of files that may require one of several decoders
as preprocessors, COMMAND should be a wrapper program or script which first
classifies FILE based on magic numbers/content or based on the FILE name and
then dispatches to an appropriate preprocessor. Each COMMAND also has its
standard input connected to FILE for convenience.
For example, a shell script for COMMAND might look like:
case \"$1\" in
*.pdf)
exec pdftotext \"$1\" -
;;
*)
case $(file \"$1\") in
*Zstandard*)
exec pzstd -cdq
;;
*)
exec cat
;;
esac
;;
esac
The above script uses `pdftotext` to convert a PDF file to plain text. For
all other files, the script uses the `file` utility to sniff the type of the
file based on its contents. If it is a compressed file in the Zstandard format,
then `pzstd` is used to decompress the contents to stdout.
This overrides the -z/--search-zip flag.
");
let arg = RGArg::flag("pre", "COMMAND")
.help(SHORT).long_help(LONG)
.overrides("no-pre")
.overrides("search-zip");
args.push(arg);
let arg = RGArg::switch("no-pre")
.hidden()
.overrides("pcre2-unicode");
.overrides("pre");
args.push(arg);
}
fn flag_pre_glob(args: &mut Vec<RGArg>) {
const SHORT: &str =
"Include or exclude files from a preprocessing command.";
const LONG: &str = long!("\
This flag works in conjunction with the --pre flag. Namely, when one or more
--pre-glob flags are given, then only files that match the given set of globs
will be handed to the command specified by the --pre flag. Any non-matching
files will be searched without using the preprocessor command.
This flag is useful when searching many files with the --pre flag. Namely,
it permits the ability to avoid process overhead for files that don't need
preprocessing. For example, given the following shell script, 'pre-pdftotext':
#!/bin/sh
pdftotext \"$1\" -
then it is possible to use '--pre pre-pdftotext --pre-glob \'*.pdf\'' to make
it so ripgrep only executes the 'pre-pdftotext' command on files with a '.pdf'
extension.
Multiple --pre-glob flags may be used. Globbing rules match .gitignore globs.
Precede a glob with a ! to exclude it.
This flag has no effect if the --pre flag is not used.
");
let arg = RGArg::flag("pre-glob", "GLOB")
.help(SHORT).long_help(LONG)
.multiple()
.allow_leading_hyphen();
args.push(arg);
}
@@ -1798,9 +2187,9 @@ This flag can be used with the -o/--only-matching flag.
fn flag_search_zip(args: &mut Vec<RGArg>) {
const SHORT: &str = "Search in compressed files.";
const LONG: &str = long!("\
Search in compressed files. Currently gz, bz2, xz, lzma and lz4 files are
supported. This option expects the decompression binaries to be available in
your PATH.
Search in compressed files. Currently gzip, bzip2, xz, LZ4, LZMA, Brotli and
Zstd files are supported. This option expects the decompression binaries to be
available in your PATH.
This flag can be disabled with --no-search-zip.
");
@@ -1816,64 +2205,6 @@ This flag can be disabled with --no-search-zip.
args.push(arg);
}
fn flag_pre(args: &mut Vec<RGArg>) {
const SHORT: &str = "search outputs of COMMAND FILE for each FILE";
const LONG: &str = long!("\
For each input FILE, search the standard output of COMMAND FILE rather than the
contents of FILE. This option expects the COMMAND program to either be an
absolute path or to be available in your PATH. Either an empty string COMMAND
or the `--no-pre` flag will disable this behavior.
WARNING: When this flag is set, ripgrep will unconditionally spawn a
process for every file that is searched. Therefore, this can incur an
unnecessarily large performance penalty if you don't otherwise need the
flexibility offered by this flag.
A preprocessor is not run when ripgrep is searching stdin.
When searching over sets of files that may require one of several decoders
as preprocessors, COMMAND should be a wrapper program or script which first
classifies FILE based on magic numbers/content or based on the FILE name and
then dispatches to an appropriate preprocessor. Each COMMAND also has its
standard input connected to FILE for convenience.
For example, a shell script for COMMAND might look like:
case \"$1\" in
*.pdf)
exec pdftotext \"$1\" -
;;
*)
case $(file \"$1\") in
*Zstandard*)
exec pzstd -cdq
;;
*)
exec cat
;;
esac
;;
esac
The above script uses `pdftotext` to convert a PDF file to plain text. For
all other files, the script uses the `file` utility to sniff the type of the
file based on its contents. If it is a compressed file in the Zstandard format,
then `pzstd` is used to decompress the contents to stdout.
This overrides the -z/--search-zip flag.
");
let arg = RGArg::flag("pre", "COMMAND")
.help(SHORT).long_help(LONG)
.overrides("no-pre")
.overrides("search-zip");
args.push(arg);
let arg = RGArg::switch("no-pre")
.hidden()
.overrides("pre");
args.push(arg);
}
fn flag_smart_case(args: &mut Vec<RGArg>) {
const SHORT: &str = "Smart case search.";
const LONG: &str = long!("\
@@ -1890,8 +2221,10 @@ This overrides the -s/--case-sensitive and -i/--ignore-case flags.
}
fn flag_sort_files(args: &mut Vec<RGArg>) {
const SHORT: &str = "Sort results by file path. Implies --threads=1.";
const SHORT: &str = "DEPRECATED";
const LONG: &str = long!("\
DEPRECATED: Use --sort or --sortr instead.
Sort results by file path. Note that this currently disables all parallelism
and runs search in a single thread.
@@ -1899,12 +2232,83 @@ This flag can be disabled with --no-sort-files.
");
let arg = RGArg::switch("sort-files")
.help(SHORT).long_help(LONG)
.overrides("no-sort-files");
.hidden()
.overrides("no-sort-files")
.overrides("sort")
.overrides("sortr");
args.push(arg);
let arg = RGArg::switch("no-sort-files")
.hidden()
.overrides("sort-files");
.overrides("sort-files")
.overrides("sort")
.overrides("sortr");
args.push(arg);
}
fn flag_sort(args: &mut Vec<RGArg>) {
const SHORT: &str =
"Sort results in ascending order. Implies --threads=1.";
const LONG: &str = long!("\
This flag enables sorting of results in ascending order. The possible values
for this flag are:
path Sort by file path.
modified Sort by the last modified time on a file.
accessed Sort by the last accessed time on a file.
created Sort by the creation time on a file.
none Do not sort results.
If the sorting criteria isn't available on your system (for example, creation
time is not available on ext4 file systems), then ripgrep will attempt to
detect this and print an error without searching any results. Otherwise, the
sort order is unspecified.
To sort results in reverse or descending order, use the --sortr flag. Also,
this flag overrides --sortr.
Note that sorting results currently always forces ripgrep to abandon
parallelism and run in a single thread.
");
let arg = RGArg::flag("sort", "SORTBY")
.help(SHORT).long_help(LONG)
.possible_values(&["path", "modified", "accessed", "created", "none"])
.overrides("sortr")
.overrides("sort-files")
.overrides("no-sort-files");
args.push(arg);
}
fn flag_sortr(args: &mut Vec<RGArg>) {
const SHORT: &str =
"Sort results in descending order. Implies --threads=1.";
const LONG: &str = long!("\
This flag enables sorting of results in descending order. The possible values
for this flag are:
path Sort by file path.
modified Sort by the last modified time on a file.
accessed Sort by the last accessed time on a file.
created Sort by the creation time on a file.
none Do not sort results.
If the sorting criteria isn't available on your system (for example, creation
time is not available on ext4 file systems), then ripgrep will attempt to
detect this and print an error without searching any results. Otherwise, the
sort order is unspecified.
To sort results in ascending order, use the --sort flag. Also, this flag
overrides --sort.
Note that sorting results currently always forces ripgrep to abandon
parallelism and run in a single thread.
");
let arg = RGArg::flag("sortr", "SORTBY")
.help(SHORT).long_help(LONG)
.possible_values(&["path", "modified", "accessed", "created", "none"])
.overrides("sort")
.overrides("sort-files")
.overrides("no-sort-files");
args.push(arg);
}
@@ -1945,20 +2349,23 @@ escape codes to be printed that alter the behavior of your terminal.
When binary file detection is enabled it is imperfect. In general, it uses
a simple heuristic. If a NUL byte is seen during search, then the file is
considered binary and search stops (unless this flag is present).
Alternatively, if the '--binary' flag is used, then ripgrep will only quit
when it sees a NUL byte after it sees a match (or searches the entire file).
Note that when the `-u/--unrestricted` flag is provided for a third time, then
this flag is automatically enabled.
This flag can be disabled with --no-text.
This flag can be disabled with '--no-text'. It overrides the '--binary' flag.
");
let arg = RGArg::switch("text").short("a")
.help(SHORT).long_help(LONG)
.overrides("no-text");
.overrides("no-text")
.overrides("binary")
.overrides("no-binary");
args.push(arg);
let arg = RGArg::switch("no-text")
.hidden()
.overrides("text");
.overrides("text")
.overrides("binary")
.overrides("no-binary");
args.push(arg);
}
@@ -2087,8 +2494,7 @@ Reduce the level of \"smart\" searching. A single -u won't respect .gitignore
(etc.) files. Two -u flags will additionally search hidden files and
directories. Three -u flags will additionally search binary files.
-uu is roughly equivalent to grep -r and -uuu is roughly equivalent to grep -a
-r.
'rg -uuu' is roughly equivalent to 'grep -r'.
");
let arg = RGArg::switch("unrestricted").short("u")
.help(SHORT).long_help(LONG)
@@ -2130,7 +2536,7 @@ ripgrep is explicitly instructed to search one file or stdin.
This flag overrides --with-filename.
");
let arg = RGArg::switch("no-filename")
let arg = RGArg::switch("no-filename").short("I")
.help(NO_SHORT).long_help(NO_LONG)
.overrides("with-filename");
args.push(arg);

View File

@@ -1,13 +1,15 @@
use std::cmp;
use std::env;
use std::ffi::OsStr;
use std::fs::File;
use std::io::{self, BufRead};
use std::ffi::{OsStr, OsString};
use std::fs;
use std::io::{self, Write};
use std::path::{Path, PathBuf};
use std::process;
use std::sync::Arc;
use std::time::SystemTime;
use atty;
use clap;
use grep::cli;
use grep::matcher::LineTerminator;
#[cfg(feature = "pcre2")]
use grep::pcre2::{
@@ -19,6 +21,7 @@ use grep::printer::{
JSON, JSONBuilder,
Standard, StandardBuilder,
Summary, SummaryBuilder, SummaryKind,
default_color_specs,
};
use grep::regex::{
RegexMatcher as RustRegexMatcher,
@@ -32,22 +35,22 @@ use ignore::types::{FileTypeDef, Types, TypesBuilder};
use ignore::{Walk, WalkBuilder, WalkParallel};
use log;
use num_cpus;
use path_printer::{PathPrinter, PathPrinterBuilder};
use regex::{self, Regex};
use same_file::Handle;
use regex;
use termcolor::{
WriteColor,
BufferedStandardStream, BufferWriter, ColorChoice, StandardStream,
BufferWriter, ColorChoice,
};
use app;
use config;
use logger::Logger;
use messages::{set_messages, set_ignore_messages};
use search::{PatternMatcher, Printer, SearchWorker, SearchWorkerBuilder};
use subject::SubjectBuilder;
use unescape::{escape, unescape};
use Result;
use crate::app;
use crate::config;
use crate::logger::Logger;
use crate::messages::{set_messages, set_ignore_messages};
use crate::path_printer::{PathPrinter, PathPrinterBuilder};
use crate::search::{
PatternMatcher, Printer, SearchWorker, SearchWorkerBuilder,
};
use crate::subject::SubjectBuilder;
use crate::Result;
/// The command that ripgrep should execute based on the command line
/// configuration.
@@ -70,6 +73,8 @@ pub enum Command {
/// List all file type definitions configured, including the default file
/// types and any additional file types added to the command line.
Types,
/// Print the version of PCRE2 in use.
PCRE2Version,
}
impl Command {
@@ -79,7 +84,11 @@ impl Command {
match *self {
Search | SearchParallel => true,
SearchNever | Files | FilesParallel | Types => false,
| SearchNever
| Files
| FilesParallel
| Types
| PCRE2Version => false,
}
}
}
@@ -128,7 +137,7 @@ impl Args {
// trying to parse config files. If a config file exists and has
// arguments, then we re-parse argv, otherwise we just use the matches
// we have here.
let early_matches = ArgMatches::new(app::app().get_matches());
let early_matches = ArgMatches::new(clap_matches(env::args_os())?);
set_messages(!early_matches.is_present("no-messages"));
set_ignore_messages(!early_matches.is_present("no-ignore-messages"));
@@ -143,7 +152,7 @@ impl Args {
log::set_max_level(log::LevelFilter::Warn);
}
let matches = early_matches.reconfigure();
let matches = early_matches.reconfigure()?;
// The logging level may have changed if we brought in additional
// arguments from a configuration file, so recheck it and set the log
// level as appropriate.
@@ -232,7 +241,9 @@ impl Args {
let threads = self.matches().threads()?;
let one_thread = is_one_search || threads == 1;
Ok(if self.matches().is_present("type-list") {
Ok(if self.matches().is_present("pcre2-version") {
Command::PCRE2Version
} else if self.matches().is_present("type-list") {
Command::Types
} else if self.matches().is_present("files") {
if one_thread {
@@ -265,6 +276,11 @@ impl Args {
Ok(builder.build(wtr))
}
/// Returns true if and only if ripgrep should be "quiet."
pub fn quiet(&self) -> bool {
self.matches().is_present("quiet")
}
/// Returns true if and only if the search should quit after finding the
/// first match.
pub fn quit_after_match(&self) -> Result<bool> {
@@ -278,14 +294,18 @@ impl Args {
&self,
wtr: W,
) -> Result<SearchWorker<W>> {
let matches = self.matches();
let matcher = self.matcher().clone();
let printer = self.printer(wtr)?;
let searcher = self.matches().searcher(self.paths())?;
let searcher = matches.searcher(self.paths())?;
let mut builder = SearchWorkerBuilder::new();
builder
.json_stats(self.matches().is_present("json"))
.preprocessor(self.matches().preprocessor())
.search_zip(self.matches().is_present("search-zip"));
.json_stats(matches.is_present("json"))
.preprocessor(matches.preprocessor())
.preprocessor_globs(matches.preprocessor_globs()?)
.search_zip(matches.is_present("search-zip"))
.binary_detection_implicit(matches.binary_detection_implicit())
.binary_detection_explicit(matches.binary_detection_explicit());
Ok(builder.build(matcher, searcher, printer))
}
@@ -307,20 +327,20 @@ impl Args {
/// file or a stream such as stdin.
pub fn subject_builder(&self) -> SubjectBuilder {
let mut builder = SubjectBuilder::new();
builder
.strip_dot_prefix(self.using_default_path())
.skip(self.matches().stdout_handle());
builder.strip_dot_prefix(self.using_default_path());
builder
}
/// Execute the given function with a writer to stdout that enables color
/// support based on the command line configuration.
pub fn stdout(&self) -> Box<WriteColor + Send> {
let color_choice = self.matches().color_choice();
if atty::is(atty::Stream::Stdout) {
Box::new(StandardStream::stdout(color_choice))
pub fn stdout(&self) -> cli::StandardStream {
let color = self.matches().color_choice();
if self.matches().is_present("line-buffered") {
cli::stdout_buffered_line(color)
} else if self.matches().is_present("block-buffered") {
cli::stdout_buffered_block(color)
} else {
Box::new(BufferedStandardStream::stdout(color_choice))
cli::stdout(color)
}
}
@@ -360,6 +380,151 @@ enum OutputKind {
JSON,
}
/// The sort criteria, if present.
#[derive(Clone, Copy, Debug, Eq, PartialEq)]
struct SortBy {
/// Whether to reverse the sort criteria (i.e., descending order).
reverse: bool,
/// The actual sorting criteria.
kind: SortByKind,
}
#[derive(Clone, Copy, Debug, Eq, PartialEq)]
enum SortByKind {
/// No sorting at all.
None,
/// Sort by path.
Path,
/// Sort by last modified time.
LastModified,
/// Sort by last accessed time.
LastAccessed,
/// Sort by creation time.
Created,
}
impl SortBy {
fn asc(kind: SortByKind) -> SortBy {
SortBy { reverse: false, kind: kind }
}
fn desc(kind: SortByKind) -> SortBy {
SortBy { reverse: true, kind: kind }
}
fn none() -> SortBy {
SortBy::asc(SortByKind::None)
}
/// Try to check that the sorting criteria selected is actually supported.
/// If it isn't, then an error is returned.
fn check(&self) -> Result<()> {
match self.kind {
SortByKind::None | SortByKind::Path => {}
SortByKind::LastModified => {
env::current_exe()?.metadata()?.modified()?;
}
SortByKind::LastAccessed => {
env::current_exe()?.metadata()?.accessed()?;
}
SortByKind::Created => {
env::current_exe()?.metadata()?.created()?;
}
}
Ok(())
}
fn configure_walk_builder(self, builder: &mut WalkBuilder) {
// This isn't entirely optimal. In particular, we will wind up issuing
// a stat for many files redundantly. Aside from having potentially
// inconsistent results with respect to sorting, this is also slow.
// We could fix this here at the expense of memory by caching stat
// calls. A better fix would be to find a way to push this down into
// directory traversal itself, but that's a somewhat nasty change.
match self.kind {
SortByKind::None => {}
SortByKind::Path => {
if self.reverse {
builder.sort_by_file_name(|a, b| a.cmp(b).reverse());
} else {
builder.sort_by_file_name(|a, b| a.cmp(b));
}
}
SortByKind::LastModified => {
builder.sort_by_file_path(move |a, b| {
sort_by_metadata_time(
a, b,
self.reverse,
|md| md.modified(),
)
});
}
SortByKind::LastAccessed => {
builder.sort_by_file_path(move |a, b| {
sort_by_metadata_time(
a, b,
self.reverse,
|md| md.accessed(),
)
});
}
SortByKind::Created => {
builder.sort_by_file_path(move |a, b| {
sort_by_metadata_time(
a, b,
self.reverse,
|md| md.created(),
)
});
}
}
}
}
impl SortByKind {
fn new(kind: &str) -> SortByKind {
match kind {
"none" => SortByKind::None,
"path" => SortByKind::Path,
"modified" => SortByKind::LastModified,
"accessed" => SortByKind::LastAccessed,
"created" => SortByKind::Created,
_ => SortByKind::None,
}
}
}
/// Encoding mode the searcher will use.
#[derive(Clone, Debug)]
enum EncodingMode {
/// Use an explicit encoding forcefully, but let BOM sniffing override it.
Some(Encoding),
/// Use only BOM sniffing to auto-detect an encoding.
Auto,
/// Use no explicit encoding and disable all BOM sniffing. This will
/// always result in searching the raw bytes, regardless of their
/// true encoding.
Disabled,
}
impl EncodingMode {
/// Checks if an explicit encoding has been set. Returns false for
/// automatic BOM sniffing and no sniffing.
///
/// This is only used to determine whether PCRE2 needs to have its own
/// UTF-8 checking enabled. If we have an explicit encoding set, then
/// we're always guaranteed to get UTF-8, so we can disable PCRE2's check.
/// Otherwise, we have no such guarantee, and must enable PCRE2' UTF-8
/// check.
#[cfg(feature = "pcre2")]
fn has_explicit_encoding(&self) -> bool {
match self {
EncodingMode::Some(_) => true,
_ => false
}
}
}
impl ArgMatches {
/// Create an ArgMatches from clap's parse result.
fn new(clap_matches: clap::ArgMatches<'static>) -> ArgMatches {
@@ -373,25 +538,27 @@ impl ArgMatches {
///
/// If there are no additional arguments from the environment (e.g., a
/// config file), then the given matches are returned as is.
fn reconfigure(self) -> ArgMatches {
fn reconfigure(self) -> Result<ArgMatches> {
// If the end user says no config, then respect it.
if self.is_present("no-config") {
debug!("not reading config files because --no-config is present");
return self;
log::debug!(
"not reading config files because --no-config is present"
);
return Ok(self);
}
// If the user wants ripgrep to use a config file, then parse args
// from that first.
let mut args = config::args();
if args.is_empty() {
return self;
return Ok(self);
}
let mut cliargs = env::args_os();
if let Some(bin) = cliargs.next() {
args.insert(0, bin);
}
args.extend(cliargs);
debug!("final argv: {:?}", args);
ArgMatches::new(app::app().get_matches_from(args))
log::debug!("final argv: {:?}", args);
Ok(ArgMatches(clap_matches(args)?))
}
/// Convert the result of parsing CLI arguments into ripgrep's higher level
@@ -432,6 +599,25 @@ impl ArgMatches {
if self.is_present("pcre2") {
let matcher = self.matcher_pcre2(patterns)?;
Ok(PatternMatcher::PCRE2(matcher))
} else if self.is_present("auto-hybrid-regex") {
let rust_err = match self.matcher_rust(patterns) {
Ok(matcher) => return Ok(PatternMatcher::RustRegex(matcher)),
Err(err) => err,
};
log::debug!(
"error building Rust regex in hybrid mode:\n{}", rust_err,
);
let pcre_err = match self.matcher_pcre2(patterns) {
Ok(matcher) => return Ok(PatternMatcher::PCRE2(matcher)),
Err(err) => err,
};
Err(From::from(format!(
"regex could not be compiled with either the default regex \
engine or with PCRE2.\n\n\
default regex engine error:\n{}\n{}\n{}\n\n\
PCRE2 regex engine error:\n{}",
"~".repeat(79), rust_err, "~".repeat(79), pcre_err,
)))
} else {
let matcher = match self.matcher_rust(patterns) {
Ok(matcher) => matcher,
@@ -500,7 +686,16 @@ impl ArgMatches {
if let Some(limit) = self.dfa_size_limit()? {
builder.dfa_size_limit(limit);
}
Ok(builder.build(&patterns.join("|"))?)
let res =
if self.is_present("fixed-strings") {
builder.build_literals(patterns)
} else {
builder.build(&patterns.join("|"))
};
match res {
Ok(m) => Ok(m),
Err(err) => Err(From::from(suggest_multiline(err.to_string()))),
}
}
/// Build a matcher using PCRE2.
@@ -515,17 +710,22 @@ impl ArgMatches {
.caseless(self.case_insensitive())
.multi_line(true)
.word(self.is_present("word-regexp"));
// For whatever reason, the JIT craps out during compilation with a
// "no more memory" error on 32 bit systems. So don't use it there.
if !cfg!(target_pointer_width = "32") {
builder.jit(true);
// For whatever reason, the JIT craps out during regex compilation with
// a "no more memory" error on 32 bit systems. So don't use it there.
if cfg!(target_pointer_width = "64") {
builder
.jit_if_available(true)
// The PCRE2 docs say that 32KB is the default, and that 1MB
// should be big enough for anything. But let's crank it to
// 10MB.
.max_jit_stack_size(Some(10 * (1<<20)));
}
if self.pcre2_unicode() {
builder.utf(true).ucp(true);
if self.encoding()?.is_some() {
if self.encoding()?.has_explicit_encoding() {
// SAFETY: If an encoding was specified, then we're guaranteed
// to get valid UTF-8, so we can disable PCRE2's UTF checking.
// (Feeding invalid UTF-8 to PCRE2 is UB.)
// (Feeding invalid UTF-8 to PCRE2 is undefined behavior.)
unsafe {
builder.disable_utf_check();
}
@@ -578,6 +778,7 @@ impl ArgMatches {
.per_match(self.is_present("vimgrep"))
.replacement(self.replacement())
.max_columns(self.max_columns()?)
.max_columns_preview(self.max_columns_preview())
.max_matches(self.max_count()?)
.column(self.column())
.byte_offset(self.is_present("byte-offset"))
@@ -637,9 +838,16 @@ impl ArgMatches {
.before_context(ctx_before)
.after_context(ctx_after)
.passthru(self.is_present("passthru"))
.memory_map(self.mmap_choice(paths))
.binary_detection(self.binary_detection())
.encoding(self.encoding()?);
.memory_map(self.mmap_choice(paths));
match self.encoding()? {
EncodingMode::Some(enc) => {
builder.encoding(Some(enc));
}
EncodingMode::Auto => {} // default for the searcher
EncodingMode::Disabled => {
builder.bom_sniffing(false);
}
}
Ok(builder.build())
}
@@ -663,23 +871,23 @@ impl ArgMatches {
.follow_links(self.is_present("follow"))
.max_filesize(self.max_file_size()?)
.threads(self.threads()?)
.same_file_system(self.is_present("one-file-system"))
.skip_stdout(!self.is_present("files"))
.overrides(self.overrides()?)
.types(self.types()?)
.hidden(!self.hidden())
.parents(!self.no_ignore_parent())
.ignore(!self.no_ignore())
.git_global(
!self.no_ignore()
&& !self.no_ignore_vcs()
&& !self.no_ignore_global())
.git_ignore(!self.no_ignore() && !self.no_ignore_vcs())
.git_exclude(!self.no_ignore() && !self.no_ignore_vcs());
.ignore(!self.no_ignore_dot())
.git_global(!self.no_ignore_vcs() && !self.no_ignore_global())
.git_ignore(!self.no_ignore_vcs())
.git_exclude(!self.no_ignore_vcs())
.ignore_case_insensitive(self.ignore_file_case_insensitive());
if !self.no_ignore() {
builder.add_custom_ignore_filename(".rgignore");
}
if self.is_present("sort-files") {
builder.sort_by_file_name(|a, b| a.cmp(b));
}
let sortby = self.sort_by()?;
sortby.check()?;
sortby.configure_walk_builder(&mut builder);
Ok(builder)
}
}
@@ -689,16 +897,39 @@ impl ArgMatches {
///
/// Methods are sorted alphabetically.
impl ArgMatches {
/// Returns the form of binary detection to perform.
fn binary_detection(&self) -> BinaryDetection {
/// Returns the form of binary detection to perform on files that are
/// implicitly searched via recursive directory traversal.
fn binary_detection_implicit(&self) -> BinaryDetection {
let none =
self.is_present("text")
|| self.is_present("null-data");
let convert =
self.is_present("binary")
|| self.unrestricted_count() >= 3;
if none {
BinaryDetection::none()
} else if convert {
BinaryDetection::convert(b'\x00')
} else {
BinaryDetection::quit(b'\x00')
}
}
/// Returns the form of binary detection to perform on files that are
/// explicitly searched via the user invoking ripgrep on a particular
/// file or files or stdin.
///
/// In general, this should never be BinaryDetection::quit, since that acts
/// as a filter (but quitting immediately once a NUL byte is seen), and we
/// should never filter out files that the user wants to explicitly search.
fn binary_detection_explicit(&self) -> BinaryDetection {
let none =
self.is_present("text")
|| self.unrestricted_count() >= 3
|| self.is_present("null-data");
if none {
BinaryDetection::none()
} else {
BinaryDetection::quit(b'\x00')
BinaryDetection::convert(b'\x00')
}
}
@@ -738,7 +969,7 @@ impl ArgMatches {
} else if preference == "ansi" {
ColorChoice::AlwaysAnsi
} else if preference == "auto" {
if atty::is(atty::Stream::Stdout) || self.is_present("pretty") {
if cli::is_tty_stdout() || self.is_present("pretty") {
ColorChoice::Auto
} else {
ColorChoice::Never
@@ -754,15 +985,7 @@ impl ArgMatches {
/// is returned.
fn color_specs(&self) -> Result<ColorSpecs> {
// Start with a default set of color specs.
let mut specs = vec![
#[cfg(unix)]
"path:fg:magenta".parse().unwrap(),
#[cfg(windows)]
"path:fg:cyan".parse().unwrap(),
"line:fg:green".parse().unwrap(),
"match:fg:red".parse().unwrap(),
"match:style:bold".parse().unwrap(),
];
let mut specs = default_color_specs();
for spec_str in self.values_of_lossy_vec("colors") {
specs.push(spec_str.parse()?);
}
@@ -798,9 +1021,9 @@ impl ArgMatches {
///
/// If one was not provided, the default `--` is returned.
fn context_separator(&self) -> Vec<u8> {
match self.value_of_lossy("context-separator") {
match self.value_of_os("context-separator") {
None => b"--".to_vec(),
Some(sep) => unescape(&sep),
Some(sep) => cli::unescape_os(&sep),
}
}
@@ -832,24 +1055,30 @@ impl ArgMatches {
u64_to_usize("dfa-size-limit", r)
}
/// Returns the type of encoding to use.
/// Returns the encoding mode to use.
///
/// This only returns an encoding if one is explicitly specified. When no
/// encoding is present, the Searcher will still do BOM sniffing for UTF-16
/// and transcode seamlessly.
fn encoding(&self) -> Result<Option<Encoding>> {
/// This only returns an encoding if one is explicitly specified. Otherwise
/// if set to automatic, the Searcher will do BOM sniffing for UTF-16
/// and transcode seamlessly. If disabled, no BOM sniffing nor transcoding
/// will occur.
fn encoding(&self) -> Result<EncodingMode> {
if self.is_present("no-encoding") {
return Ok(None);
return Ok(EncodingMode::Auto);
}
let label = match self.value_of_lossy("encoding") {
None if self.pcre2_unicode() => "utf-8".to_string(),
None => return Ok(None),
None => return Ok(EncodingMode::Auto),
Some(label) => label,
};
if label == "auto" {
return Ok(None);
return Ok(EncodingMode::Auto);
} else if label == "none" {
return Ok(EncodingMode::Disabled);
}
Ok(Some(Encoding::new(&label)?))
Ok(EncodingMode::Some(Encoding::new(&label)?))
}
/// Return the file separator to use based on the CLI configuration.
@@ -875,7 +1104,7 @@ impl ArgMatches {
if self.is_present("no-heading") || self.is_present("vimgrep") {
false
} else {
atty::is(atty::Stream::Stdout)
cli::is_tty_stdout()
|| self.is_present("heading")
|| self.is_present("pretty")
}
@@ -887,6 +1116,11 @@ impl ArgMatches {
self.is_present("hidden") || self.unrestricted_count() >= 2
}
/// Returns true if ignore files should be processed case insensitively.
fn ignore_file_case_insensitive(&self) -> bool {
self.is_present("ignore-file-case-insensitive")
}
/// Return all of the ignore file paths given on the command line.
fn ignore_paths(&self) -> Vec<PathBuf> {
let paths = match self.values_of_os("ignore-file") {
@@ -927,7 +1161,7 @@ impl ArgMatches {
// generally want to show line numbers by default when printing to a
// tty for human consumption, except for one interesting case: when
// we're only searching stdin. This makes pipelines work as expected.
(atty::is(atty::Stream::Stdout) && !self.is_only_stdin(paths))
(cli::is_tty_stdout() && !self.is_only_stdin(paths))
|| self.is_present("line-number")
|| self.is_present("column")
|| self.is_present("pretty")
@@ -941,6 +1175,12 @@ impl ArgMatches {
Ok(self.usize_of_nonzero("max-columns")?.map(|n| n as u64))
}
/// Returns true if and only if a preview should be shown for lines that
/// exceed the maximum column limit.
fn max_columns_preview(&self) -> bool {
self.is_present("max-columns-preview")
}
/// The maximum number of matches permitted.
fn max_count(&self) -> Result<Option<u64>> {
Ok(self.usize_of("max-count")?.map(|n| n as u64))
@@ -981,6 +1221,11 @@ impl ArgMatches {
self.is_present("no-ignore") || self.unrestricted_count() >= 1
}
/// Returns true if .ignore files should be ignored.
fn no_ignore_dot(&self) -> bool {
self.is_present("no-ignore-dot") || self.no_ignore()
}
/// Returns true if global ignore files should be ignored.
fn no_ignore_global(&self) -> bool {
self.is_present("no-ignore-global") || self.no_ignore()
@@ -1027,7 +1272,7 @@ impl ArgMatches {
builder.add(&glob)?;
}
// This only enables case insensitivity for subsequent globs.
builder.case_insensitive(true)?;
builder.case_insensitive(true).unwrap();
for glob in self.values_of_lossy_vec("iglob") {
builder.add(&glob)?;
}
@@ -1062,11 +1307,11 @@ impl ArgMatches {
let file_is_stdin = self.values_of_os("file")
.map_or(false, |mut files| files.any(|f| f == "-"));
let search_cwd =
atty::is(atty::Stream::Stdin)
|| !stdin_is_readable()
!cli::is_readable_stdin()
|| (self.is_present("file") && file_is_stdin)
|| self.is_present("files")
|| self.is_present("type-list");
|| self.is_present("type-list")
|| self.is_present("pcre2-version");
if search_cwd {
Path::new("./").to_path_buf()
} else {
@@ -1079,9 +1324,9 @@ impl ArgMatches {
/// If the provided path separator is more than a single byte, then an
/// error is returned.
fn path_separator(&self) -> Result<Option<u8>> {
let sep = match self.value_of_lossy("path-separator") {
let sep = match self.value_of_os("path-separator") {
None => return Ok(None),
Some(sep) => unescape(&sep),
Some(sep) => cli::unescape_os(&sep),
};
if sep.is_empty() {
Ok(None)
@@ -1092,7 +1337,7 @@ impl ArgMatches {
In some shells on Windows '/' is automatically \
expanded. Use '//' instead.",
sep.len(),
escape(&sep),
cli::escape(&sep),
)))
} else {
Ok(Some(sep[0]))
@@ -1139,18 +1384,18 @@ impl ArgMatches {
}
}
}
if let Some(files) = self.values_of_os("file") {
for file in files {
if file == "-" {
let stdin = io::stdin();
for line in stdin.lock().lines() {
pats.push(self.pattern_from_str(&line?));
}
if let Some(paths) = self.values_of_os("file") {
for path in paths {
if path == "-" {
pats.extend(cli::patterns_from_stdin()?
.into_iter()
.map(|p| self.pattern_from_string(p))
);
} else {
let f = File::open(file)?;
for line in io::BufReader::new(f).lines() {
pats.push(self.pattern_from_str(&line?));
}
pats.extend(cli::patterns_from_path(path)?
.into_iter()
.map(|p| self.pattern_from_string(p))
);
}
}
}
@@ -1172,20 +1417,24 @@ impl ArgMatches {
///
/// If the pattern is not valid UTF-8, then an error is returned.
fn pattern_from_os_str(&self, pat: &OsStr) -> Result<String> {
let s = pattern_to_str(pat)?;
let s = cli::pattern_from_os(pat)?;
Ok(self.pattern_from_str(s))
}
/// Converts a &str pattern to a String pattern. The pattern is escaped
/// if -F/--fixed-strings is set.
fn pattern_from_str(&self, pat: &str) -> String {
let litpat = self.pattern_literal(pat.to_string());
let s = self.pattern_line(litpat);
self.pattern_from_string(pat.to_string())
}
if s.is_empty() {
/// Applies additional processing on the given pattern if necessary
/// (such as escaping meta characters or turning it into a line regex).
fn pattern_from_string(&self, pat: String) -> String {
let pat = self.pattern_line(self.pattern_literal(pat));
if pat.is_empty() {
self.pattern_empty()
} else {
s
pat
}
}
@@ -1222,6 +1471,17 @@ impl ArgMatches {
Some(Path::new(path).to_path_buf())
}
/// Builds the set of globs for filtering files to apply to the --pre
/// flag. If no --pre-globs are available, then this always returns an
/// empty set of globs.
fn preprocessor_globs(&self) -> Result<Override> {
let mut builder = OverrideBuilder::new(env::current_dir()?);
for glob in self.values_of_lossy_vec("pre-glob") {
builder.add(&glob)?;
}
Ok(builder.build()?)
}
/// Parse the regex-size-limit argument option into a byte count.
fn regex_size_limit(&self) -> Result<Option<usize>> {
let r = self.parse_human_readable_size("regex-size-limit")?;
@@ -1233,6 +1493,22 @@ impl ArgMatches {
self.value_of_lossy("replace").map(|s| s.into_bytes())
}
/// Returns the sorting criteria based on command line parameters.
fn sort_by(&self) -> Result<SortBy> {
// For backcompat, continue supporting deprecated --sort-files flag.
if self.is_present("sort-files") {
return Ok(SortBy::asc(SortByKind::Path));
}
let sortby = match self.value_of_lossy("sort") {
None => match self.value_of_lossy("sortr") {
None => return Ok(SortBy::none()),
Some(choice) => SortBy::desc(SortByKind::new(&choice)),
}
Some(choice) => SortBy::asc(SortByKind::new(&choice)),
};
Ok(sortby)
}
/// Returns true if and only if aggregate statistics for a search should
/// be tracked.
///
@@ -1243,28 +1519,6 @@ impl ArgMatches {
self.output_kind() == OutputKind::JSON || self.is_present("stats")
}
/// Returns a handle to stdout for filtering search.
///
/// A handle is returned if and only if ripgrep's stdout is being
/// redirected to a file. The handle returned corresponds to that file.
///
/// This can be used to ensure that we do not attempt to search a file
/// that ripgrep is writing to.
fn stdout_handle(&self) -> Option<Handle> {
let h = match Handle::stdout() {
Err(_) => return None,
Ok(h) => h,
};
let md = match h.as_file().metadata() {
Err(_) => return None,
Ok(md) => md,
};
if !md.is_file() {
return None;
}
Some(h)
}
/// When the output format is `Summary`, this returns the type of summary
/// output to show.
///
@@ -1288,7 +1542,7 @@ impl ArgMatches {
/// Return the number of threads that should be used for parallelism.
fn threads(&self) -> Result<usize> {
if self.is_present("sort-files") {
if self.sort_by()?.kind != SortByKind::None {
return Ok(1);
}
let threads = self.usize_of("threads")?.unwrap_or(0);
@@ -1386,40 +1640,11 @@ impl ArgMatches {
&self,
arg_name: &str,
) -> Result<Option<u64>> {
lazy_static! {
static ref RE: Regex = Regex::new(r"^([0-9]+)([KMG])?$").unwrap();
}
let arg_value = match self.value_of_lossy(arg_name) {
Some(x) => x,
None => return Ok(None)
let size = match self.value_of_lossy(arg_name) {
None => return Ok(None),
Some(size) => size,
};
let caps = RE
.captures(&arg_value)
.ok_or_else(|| {
format!("invalid format for {}", arg_name)
})?;
let value = caps[1].parse::<u64>()?;
let suffix = caps.get(2).map(|x| x.as_str());
let v_10 = value.checked_mul(1024);
let v_20 = v_10.and_then(|x| x.checked_mul(1024));
let v_30 = v_20.and_then(|x| x.checked_mul(1024));
let try_suffix = |x: Option<u64>| {
if x.is_some() {
Ok(x)
} else {
Err(From::from(format!("number too large for {}", arg_name)))
}
};
match suffix {
None => Ok(Some(value)),
Some("K") => try_suffix(v_10),
Some("M") => try_suffix(v_20),
Some("G") => try_suffix(v_30),
_ => Err(From::from(format!("invalid suffix for {}", arg_name)))
}
Ok(Some(cli::parse_human_readable_size(&size)?))
}
}
@@ -1453,21 +1678,6 @@ impl ArgMatches {
}
}
/// Convert an OsStr to a Unicode string.
///
/// Patterns _must_ be valid UTF-8, so if the given OsStr isn't valid UTF-8,
/// this returns an error.
fn pattern_to_str(s: &OsStr) -> Result<&str> {
s.to_str().ok_or_else(|| {
From::from(format!(
"Argument '{}' is not valid UTF-8. \
Use hex escape sequences to match arbitrary \
bytes in a pattern (e.g., \\xFF).",
s.to_string_lossy()
))
})
}
/// Inspect an error resulting from building a Rust regex matcher, and if it's
/// believed to correspond to a syntax error that PCRE2 could handle, then
/// add a message to suggest the use of -P/--pcre2.
@@ -1483,6 +1693,17 @@ and look-around.", msg)
}
}
fn suggest_multiline(msg: String) -> String {
if msg.contains("the literal") && msg.contains("not allowed") {
format!("{}
Consider enabling multiline mode with the --multiline flag (or -U for short).
When multiline mode is enabled, new line characters can be matched.", msg)
} else {
msg
}
}
/// Convert the result of parsing a human readable file size to a `usize`,
/// failing if the type does not fit.
fn u64_to_usize(
@@ -1502,32 +1723,59 @@ fn u64_to_usize(
}
}
/// Returns true if and only if stdin is deemed searchable.
#[cfg(unix)]
fn stdin_is_readable() -> bool {
use std::os::unix::fs::FileTypeExt;
let ft = match Handle::stdin().and_then(|h| h.as_file().metadata()) {
Err(_) => return false,
Ok(md) => md.file_type(),
/// Builds a comparator for sorting two files according to a system time
/// extracted from the file's metadata.
///
/// If there was a problem extracting the metadata or if the time is not
/// available, then both entries compare equal.
fn sort_by_metadata_time<G>(
p1: &Path,
p2: &Path,
reverse: bool,
get_time: G,
) -> cmp::Ordering
where G: Fn(&fs::Metadata) -> io::Result<SystemTime>
{
let t1 = match p1.metadata().and_then(|md| get_time(&md)) {
Ok(t) => t,
Err(_) => return cmp::Ordering::Equal,
};
ft.is_file() || ft.is_fifo()
let t2 = match p2.metadata().and_then(|md| get_time(&md)) {
Ok(t) => t,
Err(_) => return cmp::Ordering::Equal,
};
if reverse {
t1.cmp(&t2).reverse()
} else {
t1.cmp(&t2)
}
}
/// Returns true if and only if stdin is deemed searchable.
#[cfg(windows)]
fn stdin_is_readable() -> bool {
use std::os::windows::io::AsRawHandle;
use winapi::um::fileapi::GetFileType;
use winapi::um::winbase::{FILE_TYPE_DISK, FILE_TYPE_PIPE};
let handle = match Handle::stdin() {
Err(_) => return false,
Ok(handle) => handle,
/// Returns a clap matches object if the given arguments parse successfully.
///
/// Otherwise, if an error occurred, then it is returned unless the error
/// corresponds to a `--help` or `--version` request. In which case, the
/// corresponding output is printed and the current process is exited
/// successfully.
fn clap_matches<I, T>(
args: I,
) -> Result<clap::ArgMatches<'static>>
where I: IntoIterator<Item=T>,
T: Into<OsString> + Clone
{
let err = match app::app().get_matches_from_safe(args) {
Ok(matches) => return Ok(matches),
Err(err) => err,
};
let raw_handle = handle.as_raw_handle();
// SAFETY: As far as I can tell, it's not possible to use GetFileType in
// a way that violates safety. We give it a handle and we get an integer.
let ft = unsafe { GetFileType(raw_handle) };
ft == FILE_TYPE_DISK || ft == FILE_TYPE_PIPE
if err.use_stderr() {
return Err(err.into());
}
// Explicitly ignore any error returned by write!. The most likely error
// at this point is a broken pipe error, in which case, we want to ignore
// it and exit quietly.
//
// (This is the point of this helper function. clap's functionality for
// doing this will panic on a broken pipe error.)
let _ = write!(io::stdout(), "{}", err);
process::exit(0);
}

View File

@@ -5,11 +5,14 @@
use std::env;
use std::error::Error;
use std::fs::File;
use std::io::{self, BufRead};
use std::io;
use std::ffi::OsString;
use std::path::{Path, PathBuf};
use Result;
use bstr::{io::BufReadExt, ByteSlice};
use log;
use crate::Result;
/// Return a sequence of arguments derived from ripgrep rc configuration files.
pub fn args() -> Vec<OsString> {
@@ -34,7 +37,7 @@ pub fn args() -> Vec<OsString> {
message!("{}:{}", config_path.display(), err);
}
}
debug!(
log::debug!(
"{}: arguments loaded from config file: {:?}",
config_path.display(),
args
@@ -52,7 +55,7 @@ pub fn args() -> Vec<OsString> {
/// for each line in addition to successfully parsed arguments.
fn parse<P: AsRef<Path>>(
path: P,
) -> Result<(Vec<OsString>, Vec<Box<Error>>)> {
) -> Result<(Vec<OsString>, Vec<Box<dyn Error>>)> {
let path = path.as_ref();
match File::open(&path) {
Ok(file) => parse_reader(file),
@@ -73,63 +76,30 @@ fn parse<P: AsRef<Path>>(
/// in addition to successfully parsed arguments.
fn parse_reader<R: io::Read>(
rdr: R,
) -> Result<(Vec<OsString>, Vec<Box<Error>>)> {
let mut bufrdr = io::BufReader::new(rdr);
) -> Result<(Vec<OsString>, Vec<Box<dyn Error>>)> {
let bufrdr = io::BufReader::new(rdr);
let (mut args, mut errs) = (vec![], vec![]);
let mut line = vec![];
let mut line_number = 0;
while {
line.clear();
bufrdr.for_byte_line_with_terminator(|line| {
line_number += 1;
bufrdr.read_until(b'\n', &mut line)? > 0
} {
trim(&mut line);
let line = line.trim();
if line.is_empty() || line[0] == b'#' {
continue;
return Ok(true);
}
match bytes_to_os_string(&line) {
match line.to_os_str() {
Ok(osstr) => {
args.push(osstr);
args.push(osstr.to_os_string());
}
Err(err) => {
errs.push(format!("{}: {}", line_number, err).into());
}
}
}
Ok(true)
})?;
Ok((args, errs))
}
/// Trim the given bytes of whitespace according to the ASCII definition.
fn trim(x: &mut Vec<u8>) {
let upto = x.iter().take_while(|b| is_space(**b)).count();
x.drain(..upto);
let revto = x.len() - x.iter().rev().take_while(|b| is_space(**b)).count();
x.drain(revto..);
}
/// Returns true if and only if the given byte is an ASCII space character.
fn is_space(b: u8) -> bool {
b == b'\t'
|| b == b'\n'
|| b == b'\x0B'
|| b == b'\x0C'
|| b == b'\r'
|| b == b' '
}
/// On Unix, get an OsString from raw bytes.
#[cfg(unix)]
fn bytes_to_os_string(bytes: &[u8]) -> Result<OsString> {
use std::os::unix::ffi::OsStringExt;
Ok(OsString::from_vec(bytes.to_vec()))
}
/// On non-Unix (like Windows), require UTF-8.
#[cfg(not(unix))]
fn bytes_to_os_string(bytes: &[u8]) -> Result<OsString> {
String::from_utf8(bytes.to_vec()).map(OsString::from).map_err(From::from)
}
#[cfg(test)]
mod tests {
use std::ffi::OsString;

View File

@@ -1,190 +0,0 @@
use std::collections::HashMap;
use std::ffi::OsStr;
use std::fmt;
use std::io::{self, Read};
use std::path::Path;
use std::process::{self, Stdio};
use globset::{Glob, GlobSet, GlobSetBuilder};
/// A decompression command, contains the command to be spawned as well as any
/// necessary CLI args.
#[derive(Clone, Copy, Debug)]
struct DecompressionCommand {
cmd: &'static str,
args: &'static [&'static str],
}
impl DecompressionCommand {
/// Create a new decompress command
fn new(
cmd: &'static str,
args: &'static [&'static str],
) -> DecompressionCommand {
DecompressionCommand {
cmd, args
}
}
}
impl fmt::Display for DecompressionCommand {
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
write!(f, "{} {}", self.cmd, self.args.join(" "))
}
}
lazy_static! {
static ref DECOMPRESSION_COMMANDS: HashMap<
&'static str,
DecompressionCommand,
> = {
let mut m = HashMap::new();
const ARGS: &[&str] = &["-d", "-c"];
m.insert("gz", DecompressionCommand::new("gzip", ARGS));
m.insert("bz2", DecompressionCommand::new("bzip2", ARGS));
m.insert("xz", DecompressionCommand::new("xz", ARGS));
m.insert("lz4", DecompressionCommand::new("lz4", ARGS));
const LZMA_ARGS: &[&str] = &["--format=lzma", "-d", "-c"];
m.insert("lzma", DecompressionCommand::new("xz", LZMA_ARGS));
m
};
static ref SUPPORTED_COMPRESSION_FORMATS: GlobSet = {
let mut builder = GlobSetBuilder::new();
builder.add(Glob::new("*.gz").unwrap());
builder.add(Glob::new("*.bz2").unwrap());
builder.add(Glob::new("*.xz").unwrap());
builder.add(Glob::new("*.lz4").unwrap());
builder.add(Glob::new("*.lzma").unwrap());
builder.build().unwrap()
};
static ref TAR_ARCHIVE_FORMATS: GlobSet = {
let mut builder = GlobSetBuilder::new();
builder.add(Glob::new("*.tar.gz").unwrap());
builder.add(Glob::new("*.tar.xz").unwrap());
builder.add(Glob::new("*.tar.bz2").unwrap());
builder.add(Glob::new("*.tar.lz4").unwrap());
builder.add(Glob::new("*.tgz").unwrap());
builder.add(Glob::new("*.txz").unwrap());
builder.add(Glob::new("*.tbz2").unwrap());
builder.build().unwrap()
};
}
/// DecompressionReader provides an `io::Read` implementation for a limited
/// set of compression formats.
#[derive(Debug)]
pub struct DecompressionReader {
cmd: DecompressionCommand,
child: process::Child,
done: bool,
}
impl DecompressionReader {
/// Returns a handle to the stdout of the spawned decompression process for
/// `path`, which can be directly searched in the worker. When the returned
/// value is exhausted, the underlying process is reaped. If the underlying
/// process fails, then its stderr is read and converted into a normal
/// io::Error.
///
/// If there is any error in spawning the decompression command, then
/// return `None`, after outputting any necessary debug or error messages.
pub fn from_path(path: &Path) -> Option<DecompressionReader> {
let extension = match path.extension().and_then(OsStr::to_str) {
Some(extension) => extension,
None => {
debug!(
"{}: failed to get compresson extension", path.display());
return None;
}
};
let decompression_cmd = match DECOMPRESSION_COMMANDS.get(extension) {
Some(cmd) => cmd,
None => {
debug!(
"{}: failed to get decompression command", path.display());
return None;
}
};
let cmd = process::Command::new(decompression_cmd.cmd)
.args(decompression_cmd.args)
.arg(path)
.stdout(Stdio::piped())
.stderr(Stdio::piped())
.spawn();
let child = match cmd {
Ok(process) => process,
Err(_) => {
debug!(
"{}: decompression command '{}' not found",
path.display(), decompression_cmd.cmd);
return None;
}
};
Some(DecompressionReader::new(*decompression_cmd, child))
}
fn new(
cmd: DecompressionCommand,
child: process::Child,
) -> DecompressionReader {
DecompressionReader {
cmd: cmd,
child: child,
done: false,
}
}
fn read_error(&mut self) -> io::Result<io::Error> {
let mut errbytes = vec![];
self.child.stderr.as_mut().unwrap().read_to_end(&mut errbytes)?;
let errstr = String::from_utf8_lossy(&errbytes);
let errstr = errstr.trim();
Ok(if errstr.is_empty() {
let msg = format!("decompression command failed: '{}'", self.cmd);
io::Error::new(io::ErrorKind::Other, msg)
} else {
let msg = format!(
"decompression command '{}' failed: {}", self.cmd, errstr);
io::Error::new(io::ErrorKind::Other, msg)
})
}
}
impl io::Read for DecompressionReader {
fn read(&mut self, buf: &mut [u8]) -> io::Result<usize> {
if self.done {
return Ok(0);
}
let nread = self.child.stdout.as_mut().unwrap().read(buf)?;
if nread == 0 {
self.done = true;
// Reap the child now that we're done reading.
// If the command failed, report stderr as an error.
if !self.child.wait()?.success() {
return Err(self.read_error()?);
}
}
Ok(nread)
}
}
/// Returns true if the given path contains a supported compression format or
/// is a TAR archive.
pub fn is_compressed(path: &Path) -> bool {
is_supported_compression_format(path) || is_tar_archive(path)
}
/// Returns true if the given path matches any one of the supported compression
/// formats
fn is_supported_compression_format(path: &Path) -> bool {
SUPPORTED_COMPRESSION_FORMATS.is_match(path)
}
/// Returns true if the given path matches any of the known TAR file formats.
fn is_tar_archive(path: &Path) -> bool {
TAR_ARCHIVE_FORMATS.is_match(path)
}

View File

@@ -1,23 +1,5 @@
extern crate atty;
#[macro_use]
extern crate clap;
extern crate globset;
extern crate grep;
extern crate ignore;
#[macro_use]
extern crate lazy_static;
#[macro_use]
extern crate log;
extern crate num_cpus;
extern crate regex;
extern crate same_file;
#[macro_use]
extern crate serde_json;
extern crate termcolor;
#[cfg(windows)]
extern crate winapi;
use std::io;
use std::error;
use std::io::{self, Write};
use std::process;
use std::sync::{Arc, Mutex};
use std::time::Instant;
@@ -33,44 +15,69 @@ mod messages;
mod app;
mod args;
mod config;
mod decompressor;
mod preprocessor;
mod logger;
mod path_printer;
mod search;
mod subject;
mod unescape;
type Result<T> = ::std::result::Result<T, Box<::std::error::Error>>;
// Since Rust no longer uses jemalloc by default, ripgrep will, by default,
// use the system allocator. On Linux, this would normally be glibc's
// allocator, which is pretty good. In particular, ripgrep does not have a
// particularly allocation heavy workload, so there really isn't much
// difference (for ripgrep's purposes) between glibc's allocator and jemalloc.
//
// However, when ripgrep is built with musl, this means ripgrep will use musl's
// allocator, which appears to be substantially worse. (musl's goal is not to
// have the fastest version of everything. Its goal is to be small and amenable
// to static compilation.) Even though ripgrep isn't particularly allocation
// heavy, musl's allocator appears to slow down ripgrep quite a bit. Therefore,
// when building with musl, we use jemalloc.
//
// We don't unconditionally use jemalloc because it can be nice to use the
// system's default allocator by default. Moreover, jemalloc seems to increase
// compilation times by a bit.
//
// Moreover, we only do this on 64-bit systems since jemalloc doesn't support
// i686.
#[cfg(all(target_env = "musl", target_pointer_width = "64"))]
#[global_allocator]
static ALLOC: jemallocator::Jemalloc = jemallocator::Jemalloc;
type Result<T> = ::std::result::Result<T, Box<dyn error::Error>>;
fn main() {
match Args::parse().and_then(try_main) {
Ok(true) => process::exit(0),
Ok(false) => process::exit(1),
Err(err) => {
eprintln!("{}", err);
process::exit(2);
}
if let Err(err) = Args::parse().and_then(try_main) {
eprintln!("{}", err);
process::exit(2);
}
}
fn try_main(args: Args) -> Result<bool> {
fn try_main(args: Args) -> Result<()> {
use args::Command::*;
match args.command()? {
Search => search(args),
SearchParallel => search_parallel(args),
SearchNever => Ok(false),
Files => files(args),
FilesParallel => files_parallel(args),
Types => types(args),
let matched =
match args.command()? {
Search => search(&args),
SearchParallel => search_parallel(&args),
SearchNever => Ok(false),
Files => files(&args),
FilesParallel => files_parallel(&args),
Types => types(&args),
PCRE2Version => pcre2_version(&args),
}?;
if matched && (args.quiet() || !messages::errored()) {
process::exit(0)
} else if messages::errored() {
process::exit(2)
} else {
process::exit(1)
}
}
/// The top-level entry point for single-threaded search. This recursively
/// steps through the file list (current directory by default) and searches
/// each file sequentially.
fn search(args: Args) -> Result<bool> {
fn search(args: &Args) -> Result<bool> {
let started_at = Instant::now();
let quit_after_match = args.quit_after_match()?;
let subject_builder = args.subject_builder();
@@ -90,7 +97,7 @@ fn search(args: Args) -> Result<bool> {
if err.kind() == io::ErrorKind::BrokenPipe {
break;
}
message!("{}: {}", subject.path().display(), err);
err_message!("{}: {}", subject.path().display(), err);
continue;
}
};
@@ -113,7 +120,7 @@ fn search(args: Args) -> Result<bool> {
/// The top-level entry point for multi-threaded search. The parallelism is
/// itself achieved by the recursive directory traversal. All we need to do is
/// feed it a worker for performing a search on each file.
fn search_parallel(args: Args) -> Result<bool> {
fn search_parallel(args: &Args) -> Result<bool> {
use std::sync::atomic::AtomicBool;
use std::sync::atomic::Ordering::SeqCst;
@@ -149,7 +156,7 @@ fn search_parallel(args: Args) -> Result<bool> {
let search_result = match searcher.search(&subject) {
Ok(search_result) => search_result,
Err(err) => {
message!("{}: {}", subject.path().display(), err);
err_message!("{}: {}", subject.path().display(), err);
return WalkState::Continue;
}
};
@@ -166,7 +173,7 @@ fn search_parallel(args: Args) -> Result<bool> {
return WalkState::Quit;
}
// Otherwise, we continue on our merry way.
message!("{}: {}", subject.path().display(), err);
err_message!("{}: {}", subject.path().display(), err);
}
if matched.load(SeqCst) && quit_after_match {
WalkState::Quit
@@ -191,7 +198,7 @@ fn search_parallel(args: Args) -> Result<bool> {
/// The top-level entry point for listing files without searching them. This
/// recursively steps through the file list (current directory by default) and
/// prints each path sequentially using a single thread.
fn files(args: Args) -> Result<bool> {
fn files(args: &Args) -> Result<bool> {
let quit_after_match = args.quit_after_match()?;
let subject_builder = args.subject_builder();
let mut matched = false;
@@ -221,7 +228,7 @@ fn files(args: Args) -> Result<bool> {
/// The top-level entry point for listing files without searching them. This
/// recursively steps through the file list (current directory by default) and
/// prints each path sequentially using multiple threads.
fn files_parallel(args: Args) -> Result<bool> {
fn files_parallel(args: &Args) -> Result<bool> {
use std::sync::atomic::AtomicBool;
use std::sync::atomic::Ordering::SeqCst;
use std::sync::mpsc;
@@ -273,7 +280,7 @@ fn files_parallel(args: Args) -> Result<bool> {
}
/// The top-level entry point for --type-list.
fn types(args: Args) -> Result<bool> {
fn types(args: &Args) -> Result<bool> {
let mut count = 0;
let mut stdout = args.stdout();
for def in args.type_defs()? {
@@ -293,3 +300,30 @@ fn types(args: Args) -> Result<bool> {
}
Ok(count > 0)
}
/// The top-level entry point for --pcre2-version.
fn pcre2_version(args: &Args) -> Result<bool> {
#[cfg(feature = "pcre2")]
fn imp(args: &Args) -> Result<bool> {
use grep::pcre2;
let mut stdout = args.stdout();
let (major, minor) = pcre2::version();
writeln!(stdout, "PCRE2 {}.{} is available", major, minor)?;
if cfg!(target_pointer_width = "64") && pcre2::is_jit_available() {
writeln!(stdout, "JIT is available")?;
}
Ok(true)
}
#[cfg(not(feature = "pcre2"))]
fn imp(args: &Args) -> Result<bool> {
let mut stdout = args.stdout();
writeln!(stdout, "PCRE2 is not available in this build of ripgrep.")?;
Ok(false)
}
imp(args)
}

View File

@@ -1,21 +1,35 @@
use std::sync::atomic::{ATOMIC_BOOL_INIT, AtomicBool, Ordering};
use std::sync::atomic::{AtomicBool, Ordering};
static MESSAGES: AtomicBool = ATOMIC_BOOL_INIT;
static IGNORE_MESSAGES: AtomicBool = ATOMIC_BOOL_INIT;
static MESSAGES: AtomicBool = AtomicBool::new(false);
static IGNORE_MESSAGES: AtomicBool = AtomicBool::new(false);
static ERRORED: AtomicBool = AtomicBool::new(false);
/// Emit a non-fatal error message, unless messages were disabled.
#[macro_export]
macro_rules! message {
($($tt:tt)*) => {
if ::messages::messages() {
if crate::messages::messages() {
eprintln!($($tt)*);
}
}
}
/// Like message, but sets ripgrep's "errored" flag, which controls the exit
/// status.
#[macro_export]
macro_rules! err_message {
($($tt:tt)*) => {
crate::messages::set_errored();
message!($($tt)*);
}
}
/// Emit a non-fatal ignore-related error message (like a parse error), unless
/// ignore-messages were disabled.
#[macro_export]
macro_rules! ignore_message {
($($tt:tt)*) => {
if ::messages::messages() && ::messages::ignore_messages() {
if crate::messages::messages() && crate::messages::ignore_messages() {
eprintln!($($tt)*);
}
}
@@ -48,3 +62,13 @@ pub fn ignore_messages() -> bool {
pub fn set_ignore_messages(yes: bool) {
IGNORE_MESSAGES.store(yes, Ordering::SeqCst)
}
/// Returns true if and only if ripgrep came across a non-fatal error.
pub fn errored() -> bool {
ERRORED.load(Ordering::SeqCst)
}
/// Indicate that ripgrep has come across a non-fatal error.
pub fn set_errored() {
ERRORED.store(true, Ordering::SeqCst);
}

View File

@@ -1,93 +0,0 @@
use std::fs::File;
use std::io::{self, Read};
use std::path::{Path, PathBuf};
use std::process::{self, Stdio};
/// PreprocessorReader provides an `io::Read` impl to read kids output.
#[derive(Debug)]
pub struct PreprocessorReader {
cmd: PathBuf,
path: PathBuf,
child: process::Child,
done: bool,
}
impl PreprocessorReader {
/// Returns a handle to the stdout of the spawned preprocessor process for
/// `path`, which can be directly searched in the worker. When the returned
/// value is exhausted, the underlying process is reaped. If the underlying
/// process fails, then its stderr is read and converted into a normal
/// io::Error.
///
/// If there is any error in spawning the preprocessor command, then
/// return the corresponding error.
pub fn from_cmd_path(
cmd: PathBuf,
path: &Path,
) -> io::Result<PreprocessorReader> {
let child = process::Command::new(&cmd)
.arg(path)
.stdin(Stdio::from(File::open(path)?))
.stdout(Stdio::piped())
.stderr(Stdio::piped())
.spawn()
.map_err(|err| {
io::Error::new(
io::ErrorKind::Other,
format!(
"error running preprocessor command '{}': {}",
cmd.display(),
err,
),
)
})?;
Ok(PreprocessorReader {
cmd: cmd,
path: path.to_path_buf(),
child: child,
done: false,
})
}
fn read_error(&mut self) -> io::Result<io::Error> {
let mut errbytes = vec![];
self.child.stderr.as_mut().unwrap().read_to_end(&mut errbytes)?;
let errstr = String::from_utf8_lossy(&errbytes);
let errstr = errstr.trim();
Ok(if errstr.is_empty() {
let msg = format!(
"preprocessor command failed: '{} {}'",
self.cmd.display(),
self.path.display(),
);
io::Error::new(io::ErrorKind::Other, msg)
} else {
let msg = format!(
"preprocessor command failed: '{} {}': {}",
self.cmd.display(),
self.path.display(),
errstr,
);
io::Error::new(io::ErrorKind::Other, msg)
})
}
}
impl io::Read for PreprocessorReader {
fn read(&mut self, buf: &mut [u8]) -> io::Result<usize> {
if self.done {
return Ok(0);
}
let nread = self.child.stdout.as_mut().unwrap().read(buf)?;
if nread == 0 {
self.done = true;
// Reap the child now that we're done reading.
// If the command failed, report stderr as an error.
if !self.child.wait()?.success() {
return Err(self.read_error()?);
}
}
Ok(nread)
}
}

View File

@@ -1,19 +1,22 @@
use std::fs::File;
use std::io;
use std::path::{Path, PathBuf};
use std::process::{Command, Stdio};
use std::time::Duration;
use grep::cli;
use grep::matcher::Matcher;
#[cfg(feature = "pcre2")]
use grep::pcre2::{RegexMatcher as PCRE2RegexMatcher};
use grep::printer::{JSON, Standard, Summary, Stats};
use grep::regex::{RegexMatcher as RustRegexMatcher};
use grep::searcher::Searcher;
use grep::searcher::{BinaryDetection, Searcher};
use ignore::overrides::Override;
use serde_json as json;
use serde_json::json;
use termcolor::WriteColor;
use decompressor::{DecompressionReader, is_compressed};
use preprocessor::PreprocessorReader;
use subject::Subject;
use crate::subject::Subject;
/// The configuration for the search worker. Among a few other things, the
/// configuration primarily controls the way we show search results to users
@@ -22,7 +25,10 @@ use subject::Subject;
struct Config {
json_stats: bool,
preprocessor: Option<PathBuf>,
preprocessor_globs: Override,
search_zip: bool,
binary_implicit: BinaryDetection,
binary_explicit: BinaryDetection,
}
impl Default for Config {
@@ -30,7 +36,10 @@ impl Default for Config {
Config {
json_stats: false,
preprocessor: None,
preprocessor_globs: Override::empty(),
search_zip: false,
binary_implicit: BinaryDetection::none(),
binary_explicit: BinaryDetection::none(),
}
}
}
@@ -39,6 +48,8 @@ impl Default for Config {
#[derive(Clone, Debug)]
pub struct SearchWorkerBuilder {
config: Config,
command_builder: cli::CommandReaderBuilder,
decomp_builder: cli::DecompressionReaderBuilder,
}
impl Default for SearchWorkerBuilder {
@@ -50,7 +61,17 @@ impl Default for SearchWorkerBuilder {
impl SearchWorkerBuilder {
/// Create a new builder for configuring and constructing a search worker.
pub fn new() -> SearchWorkerBuilder {
SearchWorkerBuilder { config: Config::default() }
let mut cmd_builder = cli::CommandReaderBuilder::new();
cmd_builder.async_stderr(true);
let mut decomp_builder = cli::DecompressionReaderBuilder::new();
decomp_builder.async_stderr(true);
SearchWorkerBuilder {
config: Config::default(),
command_builder: cmd_builder,
decomp_builder: decomp_builder,
}
}
/// Create a new search worker using the given searcher, matcher and
@@ -62,7 +83,12 @@ impl SearchWorkerBuilder {
printer: Printer<W>,
) -> SearchWorker<W> {
let config = self.config.clone();
SearchWorker { config, matcher, searcher, printer }
let command_builder = self.command_builder.clone();
let decomp_builder = self.decomp_builder.clone();
SearchWorker {
config, command_builder, decomp_builder,
matcher, searcher, printer,
}
}
/// Forcefully use JSON to emit statistics, even if the underlying printer
@@ -90,6 +116,17 @@ impl SearchWorkerBuilder {
self
}
/// Set the globs for determining which files should be run through the
/// preprocessor. By default, with no globs and a preprocessor specified,
/// every file is run through the preprocessor.
pub fn preprocessor_globs(
&mut self,
globs: Override,
) -> &mut SearchWorkerBuilder {
self.config.preprocessor_globs = globs;
self
}
/// Enable the decompression and searching of common compressed files.
///
/// When enabled, if a particular file path is recognized as a compressed
@@ -101,6 +138,37 @@ impl SearchWorkerBuilder {
self.config.search_zip = yes;
self
}
/// Set the binary detection that should be used when searching files
/// found via a recursive directory search.
///
/// Generally, this binary detection may be `BinaryDetection::quit` if
/// we want to skip binary files completely.
///
/// By default, no binary detection is performed.
pub fn binary_detection_implicit(
&mut self,
detection: BinaryDetection,
) -> &mut SearchWorkerBuilder {
self.config.binary_implicit = detection;
self
}
/// Set the binary detection that should be used when searching files
/// explicitly supplied by an end user.
///
/// Generally, this binary detection should NOT be `BinaryDetection::quit`,
/// since we never want to automatically filter files supplied by the end
/// user.
///
/// By default, no binary detection is performed.
pub fn binary_detection_explicit(
&mut self,
detection: BinaryDetection,
) -> &mut SearchWorkerBuilder {
self.config.binary_explicit = detection;
self
}
}
/// The result of executing a search.
@@ -237,6 +305,8 @@ impl<W: WriteColor> Printer<W> {
#[derive(Debug)]
pub struct SearchWorker<W> {
config: Config,
command_builder: cli::CommandReaderBuilder,
decomp_builder: cli::DecompressionReaderBuilder,
matcher: PatternMatcher,
searcher: Searcher,
printer: Printer<W>,
@@ -245,7 +315,24 @@ pub struct SearchWorker<W> {
impl<W: WriteColor> SearchWorker<W> {
/// Execute a search over the given subject.
pub fn search(&mut self, subject: &Subject) -> io::Result<SearchResult> {
self.search_impl(subject)
let bin =
if subject.is_explicit() {
self.config.binary_explicit.clone()
} else {
self.config.binary_implicit.clone()
};
self.searcher.set_binary_detection(bin);
let path = subject.path();
if subject.is_stdin() {
self.search_reader(path, io::stdin().lock())
} else if self.should_preprocess(path) {
self.search_preprocessor(path)
} else if self.should_decompress(path) {
self.search_decompress(path)
} else {
self.search_path(path)
}
}
/// Return a mutable reference to the underlying printer.
@@ -271,25 +358,67 @@ impl<W: WriteColor> SearchWorker<W> {
}
}
/// Search the given subject using the appropriate strategy.
fn search_impl(&mut self, subject: &Subject) -> io::Result<SearchResult> {
let path = subject.path();
if subject.is_stdin() {
let stdin = io::stdin();
// A `return` here appeases the borrow checker. NLL will fix this.
return self.search_reader(path, stdin.lock());
} else if self.config.preprocessor.is_some() {
let cmd = self.config.preprocessor.clone().unwrap();
let rdr = PreprocessorReader::from_cmd_path(cmd, path)?;
self.search_reader(path, rdr)
} else if self.config.search_zip && is_compressed(path) {
match DecompressionReader::from_path(path) {
None => Ok(SearchResult::default()),
Some(rdr) => self.search_reader(path, rdr),
}
} else {
self.search_path(path)
/// Returns true if and only if the given file path should be
/// decompressed before searching.
fn should_decompress(&self, path: &Path) -> bool {
if !self.config.search_zip {
return false;
}
self.decomp_builder.get_matcher().has_command(path)
}
/// Returns true if and only if the given file path should be run through
/// the preprocessor.
fn should_preprocess(&self, path: &Path) -> bool {
if !self.config.preprocessor.is_some() {
return false;
}
if self.config.preprocessor_globs.is_empty() {
return true;
}
!self.config.preprocessor_globs.matched(path, false).is_ignore()
}
/// Search the given file path by first asking the preprocessor for the
/// data to search instead of opening the path directly.
fn search_preprocessor(
&mut self,
path: &Path,
) -> io::Result<SearchResult> {
let bin = self.config.preprocessor.as_ref().unwrap();
let mut cmd = Command::new(bin);
cmd.arg(path).stdin(Stdio::from(File::open(path)?));
let rdr = self
.command_builder
.build(&mut cmd)
.map_err(|err| {
io::Error::new(
io::ErrorKind::Other,
format!(
"preprocessor command could not start: '{:?}': {}",
cmd,
err,
),
)
})?;
self.search_reader(path, rdr).map_err(|err| {
io::Error::new(
io::ErrorKind::Other,
format!("preprocessor command failed: '{:?}': {}", cmd, err),
)
})
}
/// Attempt to decompress the data at the given file path and search the
/// result. If the given file path isn't recognized as a compressed file,
/// then search it without doing any decompression.
fn search_decompress(
&mut self,
path: &Path,
) -> io::Result<SearchResult> {
let rdr = self.decomp_builder.build(path)?;
self.search_reader(path, rdr)
}
/// Search the contents of the given file path.

View File

@@ -1,26 +1,18 @@
use std::io;
use std::path::Path;
use std::sync::Arc;
use ignore::{self, DirEntry};
use same_file::Handle;
use log;
/// A configuration for describing how subjects should be built.
#[derive(Clone, Debug)]
struct Config {
skip: Option<Arc<Handle>>,
strip_dot_prefix: bool,
separator: Option<u8>,
terminator: Option<u8>,
}
impl Default for Config {
fn default() -> Config {
Config {
skip: None,
strip_dot_prefix: false,
separator: None,
terminator: None,
}
}
}
@@ -49,7 +41,7 @@ impl SubjectBuilder {
match result {
Ok(dent) => self.build(dent),
Err(err) => {
message!("{}", err);
err_message!("{}", err);
None
}
}
@@ -67,37 +59,12 @@ impl SubjectBuilder {
if let Some(ignore_err) = subj.dent.error() {
ignore_message!("{}", ignore_err);
}
// If this entry represents stdin, then we always search it.
if subj.dent.is_stdin() {
// If this entry was explicitly provided by an end user, then we always
// want to search it.
if subj.is_explicit() {
return Some(subj);
}
// If we're supposed to skip a particular file, then skip it.
if let Some(ref handle) = self.config.skip {
match subj.equals(handle) {
Ok(false) => {} // fallthrough
Ok(true) => {
debug!(
"ignoring {}: (probably same file as stdout)",
subj.dent.path().display()
);
return None;
}
Err(err) => {
debug!(
"ignoring {}: got error: {}",
subj.dent.path().display(), err
);
return None;
}
}
}
// If this subject has a depth of 0, then it was provided explicitly
// by an end user (or via a shell glob). In this case, we always want
// to search it if it even smells like a file (e.g., a symlink).
if subj.dent.depth() == 0 && !subj.is_dir() {
return Some(subj);
}
// At this point, we only want to search something it's explicitly a
// At this point, we only want to search something if it's explicitly a
// file. This omits symlinks. (If ripgrep was configured to follow
// symlinks, then they have already been followed by the directory
// traversal.)
@@ -108,7 +75,7 @@ impl SubjectBuilder {
// directory. Otherwise, emitting messages for directories is just
// noisy.
if !subj.is_dir() {
debug!(
log::debug!(
"ignoring {}: failed to pass subject filter: \
file type: {:?}, metadata: {:?}",
subj.dent.path().display(),
@@ -119,22 +86,6 @@ impl SubjectBuilder {
None
}
/// When provided, subjects that represent the same file as the handle
/// given will be skipped.
///
/// Typically, it is useful to pass a handle referring to stdout, such
/// that the file being written to isn't searched, which can lead to
/// an unbounded feedback mechanism.
///
/// Only one handle to skip can be provided.
pub fn skip(
&mut self,
handle: Option<Handle>,
) -> &mut SubjectBuilder {
self.config.skip = handle.map(Arc::new);
self
}
/// When enabled, if the subject's file path starts with `./` then it is
/// stripped.
///
@@ -171,60 +122,43 @@ impl Subject {
self.dent.is_stdin()
}
/// Returns true if and only if this subject points to a directory.
/// Returns true if and only if this entry corresponds to a subject to
/// search that was explicitly supplied by an end user.
///
/// This works around a bug in Rust's standard library:
/// https://github.com/rust-lang/rust/issues/46484
#[cfg(windows)]
fn is_dir(&self) -> bool {
use std::os::windows::fs::MetadataExt;
use winapi::um::winnt::FILE_ATTRIBUTE_DIRECTORY;
self.dent.metadata().map(|md| {
md.file_attributes() & FILE_ATTRIBUTE_DIRECTORY != 0
}).unwrap_or(false)
/// Generally, this corresponds to either stdin or an explicit file path
/// argument. e.g., in `rg foo some-file ./some-dir/`, `some-file` is
/// an explicit subject, but, e.g., `./some-dir/some-other-file` is not.
///
/// However, note that ripgrep does not see through shell globbing. e.g.,
/// in `rg foo ./some-dir/*`, `./some-dir/some-other-file` will be treated
/// as an explicit subject.
pub fn is_explicit(&self) -> bool {
// stdin is obvious. When an entry has a depth of 0, that means it
// was explicitly provided to our directory iterator, which means it
// was in turn explicitly provided by the end user. The !is_dir check
// means that we want to search files even if their symlinks, again,
// because they were explicitly provided. (And we never want to try
// to search a directory.)
self.is_stdin() || (self.dent.depth() == 0 && !self.is_dir())
}
/// Returns true if and only if this subject points to a directory.
#[cfg(not(windows))]
/// Returns true if and only if this subject points to a directory after
/// following symbolic links.
fn is_dir(&self) -> bool {
self.dent.file_type().map_or(false, |ft| ft.is_dir())
let ft = match self.dent.file_type() {
None => return false,
Some(ft) => ft,
};
if ft.is_dir() {
return true;
}
// If this is a symlink, then we want to follow it to determine
// whether it's a directory or not.
self.dent.path_is_symlink() && self.dent.path().is_dir()
}
/// Returns true if and only if this subject points to a file.
///
/// This works around a bug in Rust's standard library:
/// https://github.com/rust-lang/rust/issues/46484
#[cfg(windows)]
fn is_file(&self) -> bool {
!self.is_dir()
}
/// Returns true if and only if this subject points to a file.
#[cfg(not(windows))]
fn is_file(&self) -> bool {
self.dent.file_type().map_or(false, |ft| ft.is_file())
}
/// Returns true if and only if this subject is believed to be equivalent
/// to the given handle. If there was a problem querying this subject for
/// information to determine equality, then that error is returned.
fn equals(&self, handle: &Handle) -> io::Result<bool> {
#[cfg(unix)]
fn never_equal(dent: &DirEntry, handle: &Handle) -> bool {
dent.ino() != Some(handle.ino())
}
#[cfg(not(unix))]
fn never_equal(_: &DirEntry, _: &Handle) -> bool {
false
}
// If we know for sure that these two things aren't equal, then avoid
// the costly extra stat call to determine equality.
if self.dent.is_stdin() || never_equal(&self.dent, handle) {
return Ok(false);
}
Handle::from_path(self.path()).map(|h| &h == handle)
}
}

View File

@@ -1,137 +0,0 @@
/// A single state in the state machine used by `unescape`.
#[derive(Clone, Copy, Eq, PartialEq)]
enum State {
/// The state after seeing a `\`.
Escape,
/// The state after seeing a `\x`.
HexFirst,
/// The state after seeing a `\x[0-9A-Fa-f]`.
HexSecond(char),
/// Default state.
Literal,
}
/// Escapes an arbitrary byte slice such that it can be presented as a human
/// readable string.
pub fn escape(bytes: &[u8]) -> String {
use std::ascii::escape_default;
let escaped = bytes.iter().flat_map(|&b| escape_default(b)).collect();
String::from_utf8(escaped).unwrap()
}
/// Unescapes a string given on the command line. It supports a limited set of
/// escape sequences:
///
/// * `\t`, `\r` and `\n` are mapped to their corresponding ASCII bytes.
/// * `\xZZ` hexadecimal escapes are mapped to their byte.
pub fn unescape(s: &str) -> Vec<u8> {
use self::State::*;
let mut bytes = vec![];
let mut state = Literal;
for c in s.chars() {
match state {
Escape => {
match c {
'n' => { bytes.push(b'\n'); state = Literal; }
'r' => { bytes.push(b'\r'); state = Literal; }
't' => { bytes.push(b'\t'); state = Literal; }
'x' => { state = HexFirst; }
c => {
bytes.extend(format!(r"\{}", c).into_bytes());
state = Literal;
}
}
}
HexFirst => {
match c {
'0'...'9' | 'A'...'F' | 'a'...'f' => {
state = HexSecond(c);
}
c => {
bytes.extend(format!(r"\x{}", c).into_bytes());
state = Literal;
}
}
}
HexSecond(first) => {
match c {
'0'...'9' | 'A'...'F' | 'a'...'f' => {
let ordinal = format!("{}{}", first, c);
let byte = u8::from_str_radix(&ordinal, 16).unwrap();
bytes.push(byte);
state = Literal;
}
c => {
let original = format!(r"\x{}{}", first, c);
bytes.extend(original.into_bytes());
state = Literal;
}
}
}
Literal => {
match c {
'\\' => { state = Escape; }
c => { bytes.extend(c.to_string().as_bytes()); }
}
}
}
}
match state {
Escape => bytes.push(b'\\'),
HexFirst => bytes.extend(b"\\x"),
HexSecond(c) => bytes.extend(format!("\\x{}", c).into_bytes()),
Literal => {}
}
bytes
}
#[cfg(test)]
mod tests {
use super::unescape;
fn b(bytes: &'static [u8]) -> Vec<u8> {
bytes.to_vec()
}
#[test]
fn unescape_nul() {
assert_eq!(b(b"\x00"), unescape(r"\x00"));
}
#[test]
fn unescape_nl() {
assert_eq!(b(b"\n"), unescape(r"\n"));
}
#[test]
fn unescape_tab() {
assert_eq!(b(b"\t"), unescape(r"\t"));
}
#[test]
fn unescape_carriage() {
assert_eq!(b(b"\r"), unescape(r"\r"));
}
#[test]
fn unescape_nothing_simple() {
assert_eq!(b(b"\\a"), unescape(r"\a"));
}
#[test]
fn unescape_nothing_hex0() {
assert_eq!(b(b"\\x"), unescape(r"\x"));
}
#[test]
fn unescape_nothing_hex1() {
assert_eq!(b(b"\\xz"), unescape(r"\xz"));
}
#[test]
fn unescape_nothing_hex2() {
assert_eq!(b(b"\\xzz"), unescape(r"\xzz"));
}
}

315
tests/binary.rs Normal file
View File

@@ -0,0 +1,315 @@
use crate::util::{Dir, TestCommand};
// This file contains a smattering of tests specifically for checking ripgrep's
// handling of binary files. There's quite a bit of discussion on this in this
// bug report: https://github.com/BurntSushi/ripgrep/issues/306
// Our haystack is the first 500 lines of Gutenberg's copy of "A Study in
// Scarlet," with a NUL byte at line 237: `abcdef\x00`.
//
// The position and size of the haystack is, unfortunately, significant. In
// particular, the NUL byte is specifically inserted at some point *after* the
// first 8192 bytes, which corresponds to the initial capacity of the buffer
// that ripgrep uses to read files. (grep for DEFAULT_BUFFER_CAPACITY.) The
// position of the NUL byte ensures that we can execute some search on the
// initial buffer contents without ever detecting any binary data. Moreover,
// when using a memory map for searching, only the first 8192 bytes are
// scanned for a NUL byte, so no binary bytes are detected at all when using
// a memory map (unless our query matches line 237).
//
// One last note: in the tests below, we use --no-mmap heavily because binary
// detection with memory maps is a bit different. Namely, NUL bytes are only
// searched for in the first few KB of the file and in a match. Normally, NUL
// bytes are searched for everywhere.
//
// TODO: Add tests for binary file detection when using memory maps.
const HAY: &'static [u8] = include_bytes!("./data/sherlock-nul.txt");
// This tests that ripgrep prints a warning message if it finds and prints a
// match in a binary file before detecting that it is a binary file. The point
// here is to notify that user that the search of the file is only partially
// complete.
//
// This applies to files that are *implicitly* searched via a recursive
// directory traversal. In particular, this results in a WARNING message being
// printed. We make our file "implicit" by doing a recursive search with a glob
// that matches our file.
rgtest!(after_match1_implicit, |dir: Dir, mut cmd: TestCommand| {
dir.create_bytes("hay", HAY);
cmd.args(&[
"--no-mmap", "-n", "Project Gutenberg EBook", "-g", "hay",
]);
let expected = "\
hay:1:The Project Gutenberg EBook of A Study In Scarlet, by Arthur Conan Doyle
WARNING: stopped searching binary file hay after match (found \"\\u{0}\" byte around offset 9741)
";
eqnice!(expected, cmd.stdout());
});
// Like after_match1_implicit, except we provide a file to search
// explicitly. This results in identical behavior, but a different message.
rgtest!(after_match1_explicit, |dir: Dir, mut cmd: TestCommand| {
dir.create_bytes("hay", HAY);
cmd.args(&[
"--no-mmap", "-n", "Project Gutenberg EBook", "hay",
]);
let expected = "\
1:The Project Gutenberg EBook of A Study In Scarlet, by Arthur Conan Doyle
Binary file matches (found \"\\u{0}\" byte around offset 9741)
";
eqnice!(expected, cmd.stdout());
});
// Like after_match1_explicit, except we feed our content on stdin.
rgtest!(after_match1_stdin, |_: Dir, mut cmd: TestCommand| {
cmd.args(&[
"--no-mmap", "-n", "Project Gutenberg EBook",
]);
let expected = "\
1:The Project Gutenberg EBook of A Study In Scarlet, by Arthur Conan Doyle
Binary file matches (found \"\\u{0}\" byte around offset 9741)
";
eqnice!(expected, cmd.pipe(HAY));
});
// Like after_match1_implicit, but provides the --binary flag, which
// disables binary filtering. Thus, this matches the behavior of ripgrep as
// if the file were given explicitly.
rgtest!(after_match1_implicit_binary, |dir: Dir, mut cmd: TestCommand| {
dir.create_bytes("hay", HAY);
cmd.args(&[
"--no-mmap", "-n", "--binary", "Project Gutenberg EBook", "-g", "hay",
]);
let expected = "\
hay:1:The Project Gutenberg EBook of A Study In Scarlet, by Arthur Conan Doyle
Binary file hay matches (found \"\\u{0}\" byte around offset 9741)
";
eqnice!(expected, cmd.stdout());
});
// Like after_match1_implicit, but enables -a/--text, so no binary
// detection should be performed.
rgtest!(after_match1_implicit_text, |dir: Dir, mut cmd: TestCommand| {
dir.create_bytes("hay", HAY);
cmd.args(&[
"--no-mmap", "-n", "--text", "Project Gutenberg EBook", "-g", "hay",
]);
let expected = "\
hay:1:The Project Gutenberg EBook of A Study In Scarlet, by Arthur Conan Doyle
";
eqnice!(expected, cmd.stdout());
});
// Like after_match1_implicit_text, but enables -a/--text, so no binary
// detection should be performed.
rgtest!(after_match1_explicit_text, |dir: Dir, mut cmd: TestCommand| {
dir.create_bytes("hay", HAY);
cmd.args(&[
"--no-mmap", "-n", "--text", "Project Gutenberg EBook", "hay",
]);
let expected = "\
1:The Project Gutenberg EBook of A Study In Scarlet, by Arthur Conan Doyle
";
eqnice!(expected, cmd.stdout());
});
// Like after_match1_implicit, except this asks ripgrep to print all matching
// files.
//
// This is an interesting corner case that one might consider a bug, however,
// it's unlikely to be fixed. Namely, ripgrep probably shouldn't print `hay`
// as a matching file since it is in fact a binary file, and thus should be
// filtered out by default. However, the --files-with-matches flag will print
// out the path of a matching file as soon as a match is seen and then stop
// searching completely. Therefore, the NUL byte is never actually detected.
//
// The only way to fix this would be to kill ripgrep's performance in this case
// and continue searching the entire file for a NUL byte. (Similarly if the
// --quiet flag is set. See the next test.)
rgtest!(after_match1_implicit_path, |dir: Dir, mut cmd: TestCommand| {
dir.create_bytes("hay", HAY);
cmd.args(&[
"--no-mmap", "-l", "Project Gutenberg EBook", "-g", "hay",
]);
eqnice!("hay\n", cmd.stdout());
});
// Like after_match1_implicit_path, except this indicates that a match was
// found with no other output. (This is the same bug described above, but
// manifest as an exit code with no output.)
rgtest!(after_match1_implicit_quiet, |dir: Dir, mut cmd: TestCommand| {
dir.create_bytes("hay", HAY);
cmd.args(&[
"--no-mmap", "-q", "Project Gutenberg EBook", "-g", "hay",
]);
eqnice!("", cmd.stdout());
});
// This sets up the same test as after_match1_implicit_path, but instead of
// just printing the matching files, this includes the full count of matches.
// In this case, we need to search the entire file, so ripgrep correctly
// detects the binary data and suppresses output.
rgtest!(after_match1_implicit_count, |dir: Dir, mut cmd: TestCommand| {
dir.create_bytes("hay", HAY);
cmd.args(&[
"--no-mmap", "-c", "Project Gutenberg EBook", "-g", "hay",
]);
cmd.assert_err();
});
// Like after_match1_implicit_count, except the --binary flag is provided,
// which makes ripgrep disable binary data filtering even for implicit files.
rgtest!(after_match1_implicit_count_binary, |dir: Dir, mut cmd: TestCommand| {
dir.create_bytes("hay", HAY);
cmd.args(&[
"--no-mmap", "-c", "--binary",
"Project Gutenberg EBook",
"-g", "hay",
]);
eqnice!("hay:1\n", cmd.stdout());
});
// Like after_match1_implicit_count, except the file path is provided
// explicitly, so binary filtering is disabled and a count is correctly
// reported.
rgtest!(after_match1_explicit_count, |dir: Dir, mut cmd: TestCommand| {
dir.create_bytes("hay", HAY);
cmd.args(&[
"--no-mmap", "-c", "Project Gutenberg EBook", "hay",
]);
eqnice!("1\n", cmd.stdout());
});
// This tests that a match way before the NUL byte is shown, but a match after
// the NUL byte is not.
rgtest!(after_match2_implicit, |dir: Dir, mut cmd: TestCommand| {
dir.create_bytes("hay", HAY);
cmd.args(&[
"--no-mmap", "-n",
"Project Gutenberg EBook|a medical student",
"-g", "hay",
]);
let expected = "\
hay:1:The Project Gutenberg EBook of A Study In Scarlet, by Arthur Conan Doyle
WARNING: stopped searching binary file hay after match (found \"\\u{0}\" byte around offset 9741)
";
eqnice!(expected, cmd.stdout());
});
// Like after_match2_implicit, but enables -a/--text, so no binary
// detection should be performed.
rgtest!(after_match2_implicit_text, |dir: Dir, mut cmd: TestCommand| {
dir.create_bytes("hay", HAY);
cmd.args(&[
"--no-mmap", "-n", "--text",
"Project Gutenberg EBook|a medical student",
"-g", "hay",
]);
let expected = "\
hay:1:The Project Gutenberg EBook of A Study In Scarlet, by Arthur Conan Doyle
hay:236:\"And yet you say he is not a medical student?\"
";
eqnice!(expected, cmd.stdout());
});
// This tests that ripgrep *silently* quits before finding a match that occurs
// after a NUL byte.
rgtest!(before_match1_implicit, |dir: Dir, mut cmd: TestCommand| {
dir.create_bytes("hay", HAY);
cmd.args(&[
"--no-mmap", "-n", "Heaven", "-g", "hay",
]);
cmd.assert_err();
});
// This tests that ripgrep *does not* silently quit before finding a match that
// occurs after a NUL byte when a file is explicitly searched.
rgtest!(before_match1_explicit, |dir: Dir, mut cmd: TestCommand| {
dir.create_bytes("hay", HAY);
cmd.args(&[
"--no-mmap", "-n", "Heaven", "hay",
]);
let expected = "\
Binary file matches (found \"\\u{0}\" byte around offset 9741)
";
eqnice!(expected, cmd.stdout());
});
// Like before_match1_implicit, but enables the --binary flag, which
// disables binary filtering. Thus, this matches the behavior of ripgrep as if
// the file were given explicitly.
rgtest!(before_match1_implicit_binary, |dir: Dir, mut cmd: TestCommand| {
dir.create_bytes("hay", HAY);
cmd.args(&[
"--no-mmap", "-n", "--binary", "Heaven", "-g", "hay",
]);
let expected = "\
Binary file hay matches (found \"\\u{0}\" byte around offset 9741)
";
eqnice!(expected, cmd.stdout());
});
// Like before_match1_implicit, but enables -a/--text, so no binary
// detection should be performed.
rgtest!(before_match1_implicit_text, |dir: Dir, mut cmd: TestCommand| {
dir.create_bytes("hay", HAY);
cmd.args(&[
"--no-mmap", "-n", "--text", "Heaven", "-g", "hay",
]);
let expected = "\
hay:238:\"No. Heaven knows what the objects of his studies are. But here we
";
eqnice!(expected, cmd.stdout());
});
// This tests that ripgrep *silently* quits before finding a match that occurs
// before a NUL byte, but within the same buffer as the NUL byte.
rgtest!(before_match2_implicit, |dir: Dir, mut cmd: TestCommand| {
dir.create_bytes("hay", HAY);
cmd.args(&[
"--no-mmap", "-n", "a medical student", "-g", "hay",
]);
cmd.assert_err();
});
// This tests that ripgrep *does not* silently quit before finding a match that
// occurs before a NUL byte, but within the same buffer as the NUL byte. Even
// though the match occurs before the NUL byte, ripgrep still doesn't print it
// because it has already scanned ahead to detect the NUL byte. (This matches
// the behavior of GNU grep.)
rgtest!(before_match2_explicit, |dir: Dir, mut cmd: TestCommand| {
dir.create_bytes("hay", HAY);
cmd.args(&[
"--no-mmap", "-n", "a medical student", "hay",
]);
let expected = "\
Binary file matches (found \"\\u{0}\" byte around offset 9741)
";
eqnice!(expected, cmd.stdout());
});
// Like before_match1_implicit, but enables -a/--text, so no binary
// detection should be performed.
rgtest!(before_match2_implicit_text, |dir: Dir, mut cmd: TestCommand| {
dir.create_bytes("hay", HAY);
cmd.args(&[
"--no-mmap", "-n", "--text", "a medical student", "-g", "hay",
]);
let expected = "\
hay:236:\"And yet you say he is not a medical student?\"
";
eqnice!(expected, cmd.stdout());
});

500
tests/data/sherlock-nul.txt Normal file
View File

@@ -0,0 +1,500 @@
The Project Gutenberg EBook of A Study In Scarlet, by Arthur Conan Doyle
This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever. You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org
Title: A Study In Scarlet
Author: Arthur Conan Doyle
Posting Date: July 12, 2008 [EBook #244]
Release Date: April, 1995
[Last updated: February 17, 2013]
Language: English
*** START OF THIS PROJECT GUTENBERG EBOOK A STUDY IN SCARLET ***
Produced by Roger Squires
A STUDY IN SCARLET.
By A. Conan Doyle
[1]
Original Transcriber's Note: This etext is prepared directly
from an 1887 edition, and care has been taken to duplicate the
original exactly, including typographical and punctuation
vagaries.
Additions to the text include adding the underscore character to
indicate italics, and textual end-notes in square braces.
Project Gutenberg Editor's Note: In reproofing and moving old PG
files such as this to the present PG directory system it is the
policy to reformat the text to conform to present PG Standards.
In this case however, in consideration of the note above of the
original transcriber describing his care to try to duplicate the
original 1887 edition as to typography and punctuation vagaries,
no changes have been made in this ascii text file. However, in
the Latin-1 file and this html file, present standards are
followed and the several French and Spanish words have been
given their proper accents.
Part II, The Country of the Saints, deals much with the Mormon Church.
A STUDY IN SCARLET.
PART I.
(_Being a reprint from the reminiscences of_ JOHN H. WATSON, M.D., _late
of the Army Medical Department._) [2]
CHAPTER I. MR. SHERLOCK HOLMES.
IN the year 1878 I took my degree of Doctor of Medicine of the
University of London, and proceeded to Netley to go through the course
prescribed for surgeons in the army. Having completed my studies there,
I was duly attached to the Fifth Northumberland Fusiliers as Assistant
Surgeon. The regiment was stationed in India at the time, and before
I could join it, the second Afghan war had broken out. On landing at
Bombay, I learned that my corps had advanced through the passes, and
was already deep in the enemy's country. I followed, however, with many
other officers who were in the same situation as myself, and succeeded
in reaching Candahar in safety, where I found my regiment, and at once
entered upon my new duties.
The campaign brought honours and promotion to many, but for me it had
nothing but misfortune and disaster. I was removed from my brigade and
attached to the Berkshires, with whom I served at the fatal battle of
Maiwand. There I was struck on the shoulder by a Jezail bullet, which
shattered the bone and grazed the subclavian artery. I should have
fallen into the hands of the murderous Ghazis had it not been for the
devotion and courage shown by Murray, my orderly, who threw me across a
pack-horse, and succeeded in bringing me safely to the British lines.
Worn with pain, and weak from the prolonged hardships which I had
undergone, I was removed, with a great train of wounded sufferers, to
the base hospital at Peshawar. Here I rallied, and had already improved
so far as to be able to walk about the wards, and even to bask a little
upon the verandah, when I was struck down by enteric fever, that curse
of our Indian possessions. For months my life was despaired of, and
when at last I came to myself and became convalescent, I was so weak and
emaciated that a medical board determined that not a day should be lost
in sending me back to England. I was dispatched, accordingly, in the
troopship "Orontes," and landed a month later on Portsmouth jetty, with
my health irretrievably ruined, but with permission from a paternal
government to spend the next nine months in attempting to improve it.
I had neither kith nor kin in England, and was therefore as free as
air--or as free as an income of eleven shillings and sixpence a day will
permit a man to be. Under such circumstances, I naturally gravitated to
London, that great cesspool into which all the loungers and idlers of
the Empire are irresistibly drained. There I stayed for some time at
a private hotel in the Strand, leading a comfortless, meaningless
existence, and spending such money as I had, considerably more freely
than I ought. So alarming did the state of my finances become, that
I soon realized that I must either leave the metropolis and rusticate
somewhere in the country, or that I must make a complete alteration in
my style of living. Choosing the latter alternative, I began by making
up my mind to leave the hotel, and to take up my quarters in some less
pretentious and less expensive domicile.
On the very day that I had come to this conclusion, I was standing at
the Criterion Bar, when some one tapped me on the shoulder, and turning
round I recognized young Stamford, who had been a dresser under me at
Barts. The sight of a friendly face in the great wilderness of London is
a pleasant thing indeed to a lonely man. In old days Stamford had never
been a particular crony of mine, but now I hailed him with enthusiasm,
and he, in his turn, appeared to be delighted to see me. In the
exuberance of my joy, I asked him to lunch with me at the Holborn, and
we started off together in a hansom.
"Whatever have you been doing with yourself, Watson?" he asked in
undisguised wonder, as we rattled through the crowded London streets.
"You are as thin as a lath and as brown as a nut."
I gave him a short sketch of my adventures, and had hardly concluded it
by the time that we reached our destination.
"Poor devil!" he said, commiseratingly, after he had listened to my
misfortunes. "What are you up to now?"
"Looking for lodgings." [3] I answered. "Trying to solve the problem
as to whether it is possible to get comfortable rooms at a reasonable
price."
"That's a strange thing," remarked my companion; "you are the second man
to-day that has used that expression to me."
"And who was the first?" I asked.
"A fellow who is working at the chemical laboratory up at the hospital.
He was bemoaning himself this morning because he could not get someone
to go halves with him in some nice rooms which he had found, and which
were too much for his purse."
"By Jove!" I cried, "if he really wants someone to share the rooms and
the expense, I am the very man for him. I should prefer having a partner
to being alone."
Young Stamford looked rather strangely at me over his wine-glass. "You
don't know Sherlock Holmes yet," he said; "perhaps you would not care
for him as a constant companion."
"Why, what is there against him?"
"Oh, I didn't say there was anything against him. He is a little queer
in his ideas--an enthusiast in some branches of science. As far as I
know he is a decent fellow enough."
"A medical student, I suppose?" said I.
"No--I have no idea what he intends to go in for. I believe he is well
up in anatomy, and he is a first-class chemist; but, as far as I know,
he has never taken out any systematic medical classes. His studies are
very desultory and eccentric, but he has amassed a lot of out-of-the way
knowledge which would astonish his professors."
"Did you never ask him what he was going in for?" I asked.
"No; he is not a man that it is easy to draw out, though he can be
communicative enough when the fancy seizes him."
"I should like to meet him," I said. "If I am to lodge with anyone, I
should prefer a man of studious and quiet habits. I am not strong
enough yet to stand much noise or excitement. I had enough of both in
Afghanistan to last me for the remainder of my natural existence. How
could I meet this friend of yours?"
"He is sure to be at the laboratory," returned my companion. "He either
avoids the place for weeks, or else he works there from morning to
night. If you like, we shall drive round together after luncheon."
"Certainly," I answered, and the conversation drifted away into other
channels.
As we made our way to the hospital after leaving the Holborn, Stamford
gave me a few more particulars about the gentleman whom I proposed to
take as a fellow-lodger.
"You mustn't blame me if you don't get on with him," he said; "I know
nothing more of him than I have learned from meeting him occasionally in
the laboratory. You proposed this arrangement, so you must not hold me
responsible."
"If we don't get on it will be easy to part company," I answered. "It
seems to me, Stamford," I added, looking hard at my companion, "that you
have some reason for washing your hands of the matter. Is this fellow's
temper so formidable, or what is it? Don't be mealy-mouthed about it."
"It is not easy to express the inexpressible," he answered with a laugh.
"Holmes is a little too scientific for my tastes--it approaches to
cold-bloodedness. I could imagine his giving a friend a little pinch of
the latest vegetable alkaloid, not out of malevolence, you understand,
but simply out of a spirit of inquiry in order to have an accurate idea
of the effects. To do him justice, I think that he would take it himself
with the same readiness. He appears to have a passion for definite and
exact knowledge."
"Very right too."
"Yes, but it may be pushed to excess. When it comes to beating the
subjects in the dissecting-rooms with a stick, it is certainly taking
rather a bizarre shape."
"Beating the subjects!"
"Yes, to verify how far bruises may be produced after death. I saw him
at it with my own eyes."
"And yet you say he is not a medical student?"
abcdef
"No. Heaven knows what the objects of his studies are. But here we
are, and you must form your own impressions about him." As he spoke, we
turned down a narrow lane and passed through a small side-door, which
opened into a wing of the great hospital. It was familiar ground to me,
and I needed no guiding as we ascended the bleak stone staircase and
made our way down the long corridor with its vista of whitewashed
wall and dun-coloured doors. Near the further end a low arched passage
branched away from it and led to the chemical laboratory.
This was a lofty chamber, lined and littered with countless bottles.
Broad, low tables were scattered about, which bristled with retorts,
test-tubes, and little Bunsen lamps, with their blue flickering flames.
There was only one student in the room, who was bending over a distant
table absorbed in his work. At the sound of our steps he glanced round
and sprang to his feet with a cry of pleasure. "I've found it! I've
found it," he shouted to my companion, running towards us with a
test-tube in his hand. "I have found a re-agent which is precipitated
by hoemoglobin, [4] and by nothing else." Had he discovered a gold mine,
greater delight could not have shone upon his features.
"Dr. Watson, Mr. Sherlock Holmes," said Stamford, introducing us.
"How are you?" he said cordially, gripping my hand with a strength
for which I should hardly have given him credit. "You have been in
Afghanistan, I perceive."
"How on earth did you know that?" I asked in astonishment.
"Never mind," said he, chuckling to himself. "The question now is about
hoemoglobin. No doubt you see the significance of this discovery of
mine?"
"It is interesting, chemically, no doubt," I answered, "but
practically----"
"Why, man, it is the most practical medico-legal discovery for years.
Don't you see that it gives us an infallible test for blood stains. Come
over here now!" He seized me by the coat-sleeve in his eagerness, and
drew me over to the table at which he had been working. "Let us have
some fresh blood," he said, digging a long bodkin into his finger, and
drawing off the resulting drop of blood in a chemical pipette. "Now, I
add this small quantity of blood to a litre of water. You perceive that
the resulting mixture has the appearance of pure water. The proportion
of blood cannot be more than one in a million. I have no doubt, however,
that we shall be able to obtain the characteristic reaction." As he
spoke, he threw into the vessel a few white crystals, and then added
some drops of a transparent fluid. In an instant the contents assumed a
dull mahogany colour, and a brownish dust was precipitated to the bottom
of the glass jar.
"Ha! ha!" he cried, clapping his hands, and looking as delighted as a
child with a new toy. "What do you think of that?"
"It seems to be a very delicate test," I remarked.
"Beautiful! beautiful! The old Guiacum test was very clumsy and
uncertain. So is the microscopic examination for blood corpuscles. The
latter is valueless if the stains are a few hours old. Now, this appears
to act as well whether the blood is old or new. Had this test been
invented, there are hundreds of men now walking the earth who would long
ago have paid the penalty of their crimes."
"Indeed!" I murmured.
"Criminal cases are continually hinging upon that one point. A man is
suspected of a crime months perhaps after it has been committed. His
linen or clothes are examined, and brownish stains discovered upon them.
Are they blood stains, or mud stains, or rust stains, or fruit stains,
or what are they? That is a question which has puzzled many an expert,
and why? Because there was no reliable test. Now we have the Sherlock
Holmes' test, and there will no longer be any difficulty."
His eyes fairly glittered as he spoke, and he put his hand over his
heart and bowed as if to some applauding crowd conjured up by his
imagination.
"You are to be congratulated," I remarked, considerably surprised at his
enthusiasm.
"There was the case of Von Bischoff at Frankfort last year. He would
certainly have been hung had this test been in existence. Then there was
Mason of Bradford, and the notorious Muller, and Lefevre of Montpellier,
and Samson of New Orleans. I could name a score of cases in which it
would have been decisive."
"You seem to be a walking calendar of crime," said Stamford with a
laugh. "You might start a paper on those lines. Call it the 'Police News
of the Past.'"
"Very interesting reading it might be made, too," remarked Sherlock
Holmes, sticking a small piece of plaster over the prick on his finger.
"I have to be careful," he continued, turning to me with a smile, "for I
dabble with poisons a good deal." He held out his hand as he spoke, and
I noticed that it was all mottled over with similar pieces of plaster,
and discoloured with strong acids.
"We came here on business," said Stamford, sitting down on a high
three-legged stool, and pushing another one in my direction with
his foot. "My friend here wants to take diggings, and as you were
complaining that you could get no one to go halves with you, I thought
that I had better bring you together."
Sherlock Holmes seemed delighted at the idea of sharing his rooms with
me. "I have my eye on a suite in Baker Street," he said, "which would
suit us down to the ground. You don't mind the smell of strong tobacco,
I hope?"
"I always smoke 'ship's' myself," I answered.
"That's good enough. I generally have chemicals about, and occasionally
do experiments. Would that annoy you?"
"By no means."
"Let me see--what are my other shortcomings. I get in the dumps at
times, and don't open my mouth for days on end. You must not think I am
sulky when I do that. Just let me alone, and I'll soon be right. What
have you to confess now? It's just as well for two fellows to know the
worst of one another before they begin to live together."
I laughed at this cross-examination. "I keep a bull pup," I said, "and
I object to rows because my nerves are shaken, and I get up at all sorts
of ungodly hours, and I am extremely lazy. I have another set of vices
when I'm well, but those are the principal ones at present."
"Do you include violin-playing in your category of rows?" he asked,
anxiously.
"It depends on the player," I answered. "A well-played violin is a treat
for the gods--a badly-played one----"
"Oh, that's all right," he cried, with a merry laugh. "I think we may
consider the thing as settled--that is, if the rooms are agreeable to
you."
"When shall we see them?"
"Call for me here at noon to-morrow, and we'll go together and settle
everything," he answered.
"All right--noon exactly," said I, shaking his hand.
We left him working among his chemicals, and we walked together towards
my hotel.
"By the way," I asked suddenly, stopping and turning upon Stamford, "how
the deuce did he know that I had come from Afghanistan?"
My companion smiled an enigmatical smile. "That's just his little
peculiarity," he said. "A good many people have wanted to know how he
finds things out."
"Oh! a mystery is it?" I cried, rubbing my hands. "This is very piquant.
I am much obliged to you for bringing us together. 'The proper study of
mankind is man,' you know."
"You must study him, then," Stamford said, as he bade me good-bye.
"You'll find him a knotty problem, though. I'll wager he learns more
about you than you about him. Good-bye."
"Good-bye," I answered, and strolled on to my hotel, considerably
interested in my new acquaintance.
CHAPTER II. THE SCIENCE OF DEDUCTION.
WE met next day as he had arranged, and inspected the rooms at No. 221B,
[5] Baker Street, of which he had spoken at our meeting. They
consisted of a couple of comfortable bed-rooms and a single large
airy sitting-room, cheerfully furnished, and illuminated by two broad
windows. So desirable in every way were the apartments, and so moderate
did the terms seem when divided between us, that the bargain was
concluded upon the spot, and we at once entered into possession.
That very evening I moved my things round from the hotel, and on the
following morning Sherlock Holmes followed me with several boxes and
portmanteaus. For a day or two we were busily employed in unpacking and
laying out our property to the best advantage. That done, we
gradually began to settle down and to accommodate ourselves to our new
surroundings.
Holmes was certainly not a difficult man to live with. He was quiet
in his ways, and his habits were regular. It was rare for him to be
up after ten at night, and he had invariably breakfasted and gone out
before I rose in the morning. Sometimes he spent his day at the chemical
laboratory, sometimes in the dissecting-rooms, and occasionally in long
walks, which appeared to take him into the lowest portions of the City.
Nothing could exceed his energy when the working fit was upon him; but
now and again a reaction would seize him, and for days on end he would
lie upon the sofa in the sitting-room, hardly uttering a word or moving
a muscle from morning to night. On these occasions I have noticed such
a dreamy, vacant expression in his eyes, that I might have suspected him
of being addicted to the use of some narcotic, had not the temperance
and cleanliness of his whole life forbidden such a notion.
As the weeks went by, my interest in him and my curiosity as to his
aims in life, gradually deepened and increased. His very person and
appearance were such as to strike the attention of the most casual
observer. In height he was rather over six feet, and so excessively
lean that he seemed to be considerably taller. His eyes were sharp and
piercing, save during those intervals of torpor to which I have alluded;
and his thin, hawk-like nose gave his whole expression an air of
alertness and decision. His chin, too, had the prominence and squareness
which mark the man of determination. His hands were invariably
blotted with ink and stained with chemicals, yet he was possessed of
extraordinary delicacy of touch, as I frequently had occasion to observe
when I watched him manipulating his fragile philosophical instruments.
The reader may set me down as a hopeless busybody, when I confess how
much this man stimulated my curiosity, and how often I endeavoured
to break through the reticence which he showed on all that concerned
himself. Before pronouncing judgment, however, be it remembered, how
objectless was my life, and how little there was to engage my attention.
My health forbade me from venturing out unless the weather was
exceptionally genial, and I had no friends who would call upon me and
break the monotony of my daily existence. Under these circumstances, I
eagerly hailed the little mystery which hung around my companion, and
spent much of my time in endeavouring to unravel it.
He was not studying medicine. He had himself, in reply to a question,
confirmed Stamford's opinion upon that point. Neither did he appear to
have pursued any course of reading which might fit him for a degree in
science or any other recognized portal which would give him an entrance
into the learned world. Yet his zeal for certain studies was remarkable,
and within eccentric limits his knowledge was so extraordinarily ample
and minute that his observations have fairly astounded me. Surely no man
would work so hard or attain such precise information unless he had some
definite end in view. Desultory readers are seldom remarkable for the
exactness of their learning. No man burdens his mind with small matters
unless he has some very good reason for doing so.
His ignorance was as remarkable as his knowledge. Of contemporary
literature, philosophy and politics he appeared to know next to nothing.
Upon my quoting Thomas Carlyle, he inquired in the naivest way who he
might be and what he had done. My surprise reached a climax, however,
when I found incidentally that he was ignorant of the Copernican Theory
and of the composition of the Solar System. That any civilized human
being in this nineteenth century should not be aware that the earth
travelled round the sun appeared to be to me such an extraordinary fact
that I could hardly realize it.
"You appear to be astonished," he said, smiling at my expression of
surprise. "Now that I do know it I shall do my best to forget it."
"To forget it!"
"You see," he explained, "I consider that a man's brain originally is
like a little empty attic, and you have to stock it with such furniture
as you choose. A fool takes in all the lumber of every sort that he
comes across, so that the knowledge which might be useful to him gets
crowded out, or at best is jumbled up with a lot of other things so that
he has a difficulty in laying his hands upon it. Now the skilful workman
is very careful indeed as to what he takes into his brain-attic. He will
have nothing but the tools which may help him in doing his work, but of
these he has a large assortment, and all in the most perfect order. It
is a mistake to think that that little room has elastic walls and can
distend to any extent. Depend upon it there comes a time when for every
addition of knowledge you forget something that you knew before. It is
of the highest importance, therefore, not to have useless facts elbowing
out the useful ones."

2
tests/data/sherlock.br Normal file
View File

@@ -0,0 +1,2 @@
n<01><><EFBFBD>-_<>.<2E> <0C><><11><>cM<63><4D><EFBFBD><EFBFBD>Y<EFBFBD>4<08><><EFBFBD><EFBFBD><EFBFBD>Ya <0B>-L<>O(<28>8<EFBFBD>sn^Gwш!,<2C>
KD<EFBFBD><EFBFBD>/7<><37>th<74><1A> <0C><10>]j<02><><EFBFBD><EFBFBD><EFBFBD>E_;d<1E>rF<72>Qs<51>/:DIVB}<7D>T7<54><37>ѵ<04><16>H<EFBFBD>2<EFBFBD><32><EFBFBD>)<29><>M[u<><75><EFBFBD>i<EFBFBD><69><EFBFBD><0F>50ڮ<30>Y6<><36><EFBFBD><EFBFBD><17><><EFBFBD><EFBFBD><07>%<25>ר_<D7A8><5F>U by<62>4<EFBFBD><34>Ϡ<EFBFBD>!&<26>g<><15>#<23>

BIN
tests/data/sherlock.zst Normal file

Binary file not shown.

View File

@@ -1,5 +1,5 @@
use hay::{SHERLOCK, SHERLOCK_CRLF};
use util::{Dir, TestCommand, sort_lines};
use crate::hay::{SHERLOCK, SHERLOCK_CRLF};
use crate::util::{Dir, TestCommand, sort_lines};
// See: https://github.com/BurntSushi/ripgrep/issues/1
rgtest!(f1_sjis, |dir: Dir, mut cmd: TestCommand| {
@@ -72,7 +72,7 @@ rgtest!(f7_stdin, |dir: Dir, mut cmd: TestCommand| {
sherlock:For the Doctor Watsons of this world, as opposed to the Sherlock
sherlock:be, to a very large extent, the result of luck. Sherlock Holmes
";
eqnice!(expected, cmd.arg("-f-").pipe("Sherlock"));
eqnice!(expected, cmd.arg("-f-").pipe(b"Sherlock"));
});
// See: https://github.com/BurntSushi/ripgrep/issues/20
@@ -629,3 +629,101 @@ rgtest!(f993_null_data, |dir: Dir, mut cmd: TestCommand| {
let expected = "foo\x00bar\x00baz\x00";
eqnice!(expected, cmd.stdout());
});
// See: https://github.com/BurntSushi/ripgrep/issues/1078
//
// N.B. There are many more tests in the grep-printer crate.
rgtest!(f1078_max_columns_preview1, |dir: Dir, mut cmd: TestCommand| {
dir.create("sherlock", SHERLOCK);
cmd.args(&[
"-M46", "--max-columns-preview",
"exhibited|dusted|has to have it",
]);
let expected = "\
sherlock:but Doctor Watson has to have it taken out for [... omitted end of long line]
sherlock:and exhibited clearly, with a label attached.
";
eqnice!(expected, cmd.stdout());
});
rgtest!(f1078_max_columns_preview2, |dir: Dir, mut cmd: TestCommand| {
dir.create("sherlock", SHERLOCK);
cmd.args(&[
"-M43", "--max-columns-preview",
// Doing a replacement forces ripgrep to show the number of remaining
// matches. Normally, this happens by default when printing a tty with
// colors.
"-rxxx",
"exhibited|dusted|has to have it",
]);
let expected = "\
sherlock:but Doctor Watson xxx taken out for him and [... 1 more match]
sherlock:and xxx clearly, with a label attached.
";
eqnice!(expected, cmd.stdout());
});
// See: https://github.com/BurntSushi/ripgrep/issues/1138
rgtest!(f1138_no_ignore_dot, |dir: Dir, mut cmd: TestCommand| {
dir.create_dir(".git");
dir.create(".gitignore", "foo");
dir.create(".ignore", "bar");
dir.create(".fzf-ignore", "quux");
dir.create("foo", "");
dir.create("bar", "");
dir.create("quux", "");
cmd.arg("--sort").arg("path").arg("--files");
eqnice!("quux\n", cmd.stdout());
eqnice!("bar\nquux\n", cmd.arg("--no-ignore-dot").stdout());
eqnice!("bar\n", cmd.arg("--ignore-file").arg(".fzf-ignore").stdout());
});
// See: https://github.com/BurntSushi/ripgrep/issues/1155
rgtest!(f1155_auto_hybrid_regex, |dir: Dir, mut cmd: TestCommand| {
// No sense in testing a hybrid regex engine with only one engine!
if !dir.is_pcre2() {
return;
}
dir.create("sherlock", SHERLOCK);
cmd.arg("--no-pcre2").arg("--auto-hybrid-regex").arg(r"(?<=the )Sherlock");
let expected = "\
sherlock:For the Doctor Watsons of this world, as opposed to the Sherlock
";
eqnice!(expected, cmd.stdout());
});
// See: https://github.com/BurntSushi/ripgrep/issues/1207
//
// Tests if without encoding 'none' flag null bytes are consumed by automatic
// encoding detection.
rgtest!(f1207_auto_encoding, |dir: Dir, mut cmd: TestCommand| {
dir.create_bytes(
"foo",
b"\xFF\xFE\x00\x62"
);
cmd.arg("-a").arg("\\x00").arg("foo");
cmd.assert_exit_code(1);
});
// See: https://github.com/BurntSushi/ripgrep/issues/1207
//
// Tests if encoding 'none' flag does treat file as raw bytes
rgtest!(f1207_ignore_encoding, |dir: Dir, mut cmd: TestCommand| {
// PCRE2 chokes on this test because it can't search invalid non-UTF-8
// and the point of this test is to search raw UTF-16.
if dir.is_pcre2() {
return;
}
dir.create_bytes(
"foo",
b"\xFF\xFE\x00\x62"
);
cmd.arg("--encoding").arg("none").arg("-a").arg("\\x00").arg("foo");
eqnice!("\u{FFFD}\u{FFFD}\x00b\n", cmd.stdout());
});

View File

@@ -1,9 +1,10 @@
use std::time;
use serde_derive::Deserialize;
use serde_json as json;
use hay::{SHERLOCK, SHERLOCK_CRLF};
use util::{Dir, TestCommand};
use crate::hay::{SHERLOCK, SHERLOCK_CRLF};
use crate::util::{Dir, TestCommand};
#[derive(Clone, Debug, Deserialize, PartialEq, Eq)]
#[serde(tag = "type", content = "data")]
@@ -152,7 +153,10 @@ rgtest!(basic, |dir: Dir, mut cmd: TestCommand| {
msgs[1].unwrap_context(),
Context {
path: Some(Data::text("sherlock")),
lines: Data::text("Holmeses, success in the province of detective work must always\n"),
lines: Data::text(
"Holmeses, success in the province of \
detective work must always\n",
),
line_number: Some(2),
absolute_offset: 65,
submatches: vec![],
@@ -162,7 +166,10 @@ rgtest!(basic, |dir: Dir, mut cmd: TestCommand| {
msgs[2].unwrap_match(),
Match {
path: Some(Data::text("sherlock")),
lines: Data::text("be, to a very large extent, the result of luck. Sherlock Holmes\n"),
lines: Data::text(
"be, to a very large extent, the result of luck. \
Sherlock Holmes\n",
),
line_number: Some(3),
absolute_offset: 129,
submatches: vec![
@@ -211,7 +218,9 @@ rgtest!(notutf8, |dir: Dir, mut cmd: TestCommand| {
let contents = &b"quux\xFFbaz"[..];
// APFS does not support creating files with invalid UTF-8 bytes, so just
// skip the test if we can't create our file.
// skip the test if we can't create our file. Presumably we don't need this
// check if we're already skipping it on macOS, but maybe other file
// systems won't like this test either?
if !dir.try_create_bytes(OsStr::from_bytes(name), contents).is_ok() {
return;
}
@@ -241,6 +250,49 @@ rgtest!(notutf8, |dir: Dir, mut cmd: TestCommand| {
);
});
rgtest!(notutf8_file, |dir: Dir, mut cmd: TestCommand| {
use std::ffi::OsStr;
// This test does not work with PCRE2 because PCRE2 does not support the
// `u` flag.
if dir.is_pcre2() {
return;
}
let name = "foo";
let contents = &b"quux\xFFbaz"[..];
// APFS does not support creating files with invalid UTF-8 bytes, so just
// skip the test if we can't create our file.
if !dir.try_create_bytes(OsStr::new(name), contents).is_ok() {
return;
}
cmd.arg("--json").arg(r"(?-u)\xFF");
let msgs = json_decode(&cmd.stdout());
assert_eq!(
msgs[0].unwrap_begin(),
Begin { path: Some(Data::text("foo")) }
);
assert_eq!(
msgs[1].unwrap_match(),
Match {
path: Some(Data::text("foo")),
lines: Data::bytes("cXV1eP9iYXo="),
line_number: Some(1),
absolute_offset: 0,
submatches: vec![
SubMatch {
m: Data::bytes("/w=="),
start: 4,
end: 5,
},
],
}
);
});
// See: https://github.com/BurntSushi/ripgrep/issues/416
//
// This test in particular checks that our match does _not_ include the `\r`
@@ -261,3 +313,52 @@ rgtest!(crlf, |dir: Dir, mut cmd: TestCommand| {
},
);
});
// See: https://github.com/BurntSushi/ripgrep/issues/1095
//
// This test checks that we don't drop the \r\n in a matching line when --crlf
// mode is enabled.
rgtest!(r1095_missing_crlf, |dir: Dir, mut cmd: TestCommand| {
dir.create("foo", "test\r\n");
// Check without --crlf flag.
let msgs = json_decode(&cmd.arg("--json").arg("test").stdout());
assert_eq!(msgs.len(), 4);
assert_eq!(msgs[1].unwrap_match().lines, Data::text("test\r\n"));
// Now check with --crlf flag.
let msgs = json_decode(&cmd.arg("--crlf").stdout());
assert_eq!(msgs.len(), 4);
assert_eq!(msgs[1].unwrap_match().lines, Data::text("test\r\n"));
});
// See: https://github.com/BurntSushi/ripgrep/issues/1095
//
// This test checks that we don't return empty submatches when matching a `\n`
// in CRLF mode.
rgtest!(r1095_crlf_empty_match, |dir: Dir, mut cmd: TestCommand| {
dir.create("foo", "test\r\n\n");
// Check without --crlf flag.
let msgs = json_decode(&cmd.arg("-U").arg("--json").arg("\n").stdout());
assert_eq!(msgs.len(), 5);
let m = msgs[1].unwrap_match();
assert_eq!(m.lines, Data::text("test\r\n"));
assert_eq!(m.submatches[0].m, Data::text("\n"));
let m = msgs[2].unwrap_match();
assert_eq!(m.lines, Data::text("\n"));
assert_eq!(m.submatches[0].m, Data::text("\n"));
// Now check with --crlf flag.
let msgs = json_decode(&cmd.arg("--crlf").stdout());
let m = msgs[1].unwrap_match();
assert_eq!(m.lines, Data::text("test\r\n"));
assert_eq!(m.submatches[0].m, Data::text("\n"));
let m = msgs[2].unwrap_match();
assert_eq!(m.lines, Data::text("\n"));
assert_eq!(m.submatches[0].m, Data::text("\n"));
});

View File

@@ -3,11 +3,11 @@ macro_rules! rgtest {
($name:ident, $fun:expr) => {
#[test]
fn $name() {
let (dir, cmd) = ::util::setup(stringify!($name));
let (dir, cmd) = crate::util::setup(stringify!($name));
$fun(dir, cmd);
if cfg!(feature = "pcre2") {
let (dir, cmd) = ::util::setup_pcre2(stringify!($name));
let (dir, cmd) = crate::util::setup_pcre2(stringify!($name));
$fun(dir, cmd);
}
}

View File

@@ -1,5 +1,5 @@
use hay::SHERLOCK;
use util::{Dir, TestCommand, cmd_exists, sort_lines};
use crate::hay::SHERLOCK;
use crate::util::{Dir, TestCommand, cmd_exists, sort_lines};
// This file contains "miscellaneous" tests that were either written before
// features were tracked more explicitly, or were simply written without
@@ -752,12 +752,11 @@ rgtest!(unrestricted2, |dir: Dir, mut cmd: TestCommand| {
rgtest!(unrestricted3, |dir: Dir, mut cmd: TestCommand| {
dir.create("sherlock", SHERLOCK);
dir.create("file", "foo\x00bar\nfoo\x00baz\n");
dir.create("hay", "foo\x00bar\nfoo\x00baz\n");
cmd.arg("-uuu").arg("foo");
let expected = "\
file:foo\x00bar
file:foo\x00baz
Binary file hay matches (found \"\\u{0}\" byte around offset 3)
";
eqnice!(expected, cmd.stdout());
});
@@ -816,6 +815,24 @@ be, to a very large extent, the result of luck. Sherlock Holmes
eqnice!(expected, cmd.stdout());
});
rgtest!(preprocessing_glob, |dir: Dir, mut cmd: TestCommand| {
if !cmd_exists("xzcat") {
return;
}
dir.create("sherlock", SHERLOCK);
dir.create_bytes("sherlock.xz", include_bytes!("./data/sherlock.xz"));
cmd.args(&["--pre", "xzcat", "--pre-glob", "*.xz", "Sherlock"]);
let expected = "\
sherlock.xz:For the Doctor Watsons of this world, as opposed to the Sherlock
sherlock.xz:be, to a very large extent, the result of luck. Sherlock Holmes
sherlock:For the Doctor Watsons of this world, as opposed to the Sherlock
sherlock:be, to a very large extent, the result of luck. Sherlock Holmes
";
eqnice!(sort_lines(expected), sort_lines(&cmd.stdout()));
});
rgtest!(compressed_gzip, |dir: Dir, mut cmd: TestCommand| {
if !cmd_exists("gzip") {
return;
@@ -891,6 +908,36 @@ be, to a very large extent, the result of luck. Sherlock Holmes
eqnice!(expected, cmd.stdout());
});
rgtest!(compressed_brotli, |dir: Dir, mut cmd: TestCommand| {
if !cmd_exists("brotli") {
return;
}
dir.create_bytes("sherlock.br", include_bytes!("./data/sherlock.br"));
cmd.arg("-z").arg("Sherlock").arg("sherlock.br");
let expected = "\
For the Doctor Watsons of this world, as opposed to the Sherlock
be, to a very large extent, the result of luck. Sherlock Holmes
";
eqnice!(expected, cmd.stdout());
});
rgtest!(compressed_zstd, |dir: Dir, mut cmd: TestCommand| {
if !cmd_exists("zstd") {
return;
}
dir.create_bytes("sherlock.zst", include_bytes!("./data/sherlock.zst"));
cmd.arg("-z").arg("Sherlock").arg("sherlock.zst");
let expected = "\
For the Doctor Watsons of this world, as opposed to the Sherlock
be, to a very large extent, the result of luck. Sherlock Holmes
";
eqnice!(expected, cmd.stdout());
});
rgtest!(compressed_failing_gzip, |dir: Dir, mut cmd: TestCommand| {
if !cmd_exists("gzip") {
return;
@@ -902,10 +949,35 @@ rgtest!(compressed_failing_gzip, |dir: Dir, mut cmd: TestCommand| {
cmd.assert_non_empty_stderr();
});
rgtest!(binary_nosearch, |dir: Dir, mut cmd: TestCommand| {
rgtest!(binary_convert, |dir: Dir, mut cmd: TestCommand| {
dir.create("file", "foo\x00bar\nfoo\x00baz\n");
cmd.arg("foo").arg("file");
cmd.arg("--no-mmap").arg("foo").arg("file");
let expected = "\
Binary file matches (found \"\\u{0}\" byte around offset 3)
";
eqnice!(expected, cmd.stdout());
});
rgtest!(binary_convert_mmap, |dir: Dir, mut cmd: TestCommand| {
dir.create("file", "foo\x00bar\nfoo\x00baz\n");
cmd.arg("--mmap").arg("foo").arg("file");
let expected = "\
Binary file matches (found \"\\u{0}\" byte around offset 3)
";
eqnice!(expected, cmd.stdout());
});
rgtest!(binary_quit, |dir: Dir, mut cmd: TestCommand| {
dir.create("file", "foo\x00bar\nfoo\x00baz\n");
cmd.arg("--no-mmap").arg("foo").arg("-gfile");
cmd.assert_err();
});
rgtest!(binary_quit_mmap, |dir: Dir, mut cmd: TestCommand| {
dir.create("file", "foo\x00bar\nfoo\x00baz\n");
cmd.arg("--mmap").arg("foo").arg("-gfile");
cmd.assert_err();
});

View File

@@ -1,5 +1,5 @@
use hay::SHERLOCK;
use util::{Dir, TestCommand};
use crate::hay::SHERLOCK;
use crate::util::{Dir, TestCommand};
// This tests that multiline matches that span multiple lines, but where
// multiple matches may begin and end on the same line work correctly.
@@ -88,7 +88,7 @@ rgtest!(stdin, |_: Dir, mut cmd: TestCommand| {
1:For the Doctor Watsons of this world, as opposed to the Sherlock
2:Holmeses, success in the province of detective work must always
";
eqnice!(expected, cmd.pipe(SHERLOCK));
eqnice!(expected, cmd.pipe(SHERLOCK.as_bytes()));
});
// Test that multiline search and contextual matches work.

View File

@@ -1,5 +1,5 @@
use hay::SHERLOCK;
use util::{Dir, TestCommand, sort_lines};
use crate::hay::SHERLOCK;
use crate::util::{Dir, TestCommand, sort_lines};
// See: https://github.com/BurntSushi/ripgrep/issues/16
rgtest!(r16, |dir: Dir, mut cmd: TestCommand| {
@@ -562,3 +562,157 @@ rgtest!(r900, |dir: Dir, mut cmd: TestCommand| {
cmd.arg("-fpat").arg("sherlock").assert_err();
});
// See: https://github.com/BurntSushi/ripgrep/issues/1064
rgtest!(r1064, |dir: Dir, mut cmd: TestCommand| {
dir.create("input", "abc");
eqnice!("input:abc\n", cmd.arg("a(.*c)").stdout());
});
// See: https://github.com/BurntSushi/ripgrep/issues/1174
rgtest!(r1098, |dir: Dir, mut cmd: TestCommand| {
dir.create_dir(".git");
dir.create(".gitignore", "a**b");
dir.create("afoob", "test");
cmd.arg("test").assert_err();
});
// See: https://github.com/BurntSushi/ripgrep/issues/1130
rgtest!(r1130, |dir: Dir, mut cmd: TestCommand| {
dir.create("foo", "test");
eqnice!(
"foo\n",
cmd.arg("--files-with-matches").arg("test").arg("foo").stdout()
);
let mut cmd = dir.command();
eqnice!(
"foo\n",
cmd.arg("--files-without-match").arg("nada").arg("foo").stdout()
);
});
// See: https://github.com/BurntSushi/ripgrep/issues/1159
rgtest!(r1159_invalid_flag, |_: Dir, mut cmd: TestCommand| {
cmd.arg("--wat").assert_exit_code(2);
});
// See: https://github.com/BurntSushi/ripgrep/issues/1159
rgtest!(r1159_exit_status, |dir: Dir, _: TestCommand| {
dir.create("foo", "test");
// search with a match gets 0 exit status.
let mut cmd = dir.command();
cmd.arg("test").assert_exit_code(0);
// search with --quiet and a match gets 0 exit status.
let mut cmd = dir.command();
cmd.arg("-q").arg("test").assert_exit_code(0);
// search with a match and an error gets 2 exit status.
let mut cmd = dir.command();
cmd.arg("test").arg("no-file").assert_exit_code(2);
// search with a match in --quiet mode and an error gets 0 exit status.
let mut cmd = dir.command();
cmd.arg("-q").arg("test").arg("foo").arg("no-file").assert_exit_code(0);
// search with no match gets 1 exit status.
let mut cmd = dir.command();
cmd.arg("nada").assert_exit_code(1);
// search with --quiet and no match gets 1 exit status.
let mut cmd = dir.command();
cmd.arg("-q").arg("nada").assert_exit_code(1);
// search with no match and an error gets 2 exit status.
let mut cmd = dir.command();
cmd.arg("nada").arg("no-file").assert_exit_code(2);
// search with no match in --quiet mode and an error gets 2 exit status.
let mut cmd = dir.command();
cmd.arg("-q").arg("nada").arg("foo").arg("no-file").assert_exit_code(2);
});
// See: https://github.com/BurntSushi/ripgrep/issues/1163
rgtest!(r1163, |dir: Dir, mut cmd: TestCommand| {
dir.create("bom.txt", "\u{FEFF}test123\ntest123");
eqnice!(
"bom.txt:test123\nbom.txt:test123\n",
cmd.arg("^test123").stdout()
);
});
// See: https://github.com/BurntSushi/ripgrep/issues/1164
rgtest!(r1164, |dir: Dir, mut cmd: TestCommand| {
dir.create_dir(".git");
dir.create(".gitignore", "myfile");
dir.create("MYFILE", "test");
cmd.arg("--ignore-file-case-insensitive").arg("test").assert_err();
eqnice!(
"MYFILE:test\n",
cmd.arg("--no-ignore-file-case-insensitive").stdout()
);
});
// See: https://github.com/BurntSushi/ripgrep/issues/1173
rgtest!(r1173, |dir: Dir, mut cmd: TestCommand| {
dir.create_dir(".git");
dir.create(".gitignore", "**");
dir.create("foo", "test");
cmd.arg("test").assert_err();
});
// See: https://github.com/BurntSushi/ripgrep/issues/1174
rgtest!(r1174, |dir: Dir, mut cmd: TestCommand| {
dir.create_dir(".git");
dir.create(".gitignore", "**/**/*");
dir.create_dir("a");
dir.create("a/foo", "test");
cmd.arg("test").assert_err();
});
// See: https://github.com/BurntSushi/ripgrep/issues/1176
rgtest!(r1176_literal_file, |dir: Dir, mut cmd: TestCommand| {
dir.create("patterns", "foo(bar\n");
dir.create("test", "foo(bar");
eqnice!(
"foo(bar\n",
cmd.arg("-F").arg("-f").arg("patterns").arg("test").stdout()
);
});
// See: https://github.com/BurntSushi/ripgrep/issues/1176
rgtest!(r1176_line_regex, |dir: Dir, mut cmd: TestCommand| {
dir.create("patterns", "foo\n");
dir.create("test", "foobar\nfoo\nbarfoo\n");
eqnice!(
"foo\n",
cmd.arg("-x").arg("-f").arg("patterns").arg("test").stdout()
);
});
// See: https://github.com/BurntSushi/ripgrep/issues/1203
rgtest!(r1203_reverse_suffix_literal, |dir: Dir, _: TestCommand| {
dir.create("test", "153.230000\n");
let mut cmd = dir.command();
eqnice!("153.230000\n", cmd.arg(r"\d\d\d00").arg("test").stdout());
let mut cmd = dir.command();
eqnice!("153.230000\n", cmd.arg(r"\d\d\d000").arg("test").stdout());
});
// See: https://github.com/BurntSushi/ripgrep/issues/1259
rgtest!(r1259_drop_last_byte_nonl, |dir: Dir, mut cmd: TestCommand| {
dir.create("patterns-nonl", "[foo]");
dir.create("patterns-nl", "[foo]\n");
dir.create("test", "fz");
eqnice!("fz\n", cmd.arg("-f").arg("patterns-nonl").arg("test").stdout());
cmd = dir.command();
eqnice!("fz\n", cmd.arg("-f").arg("patterns-nl").arg("test").stdout());
});

View File

@@ -1,8 +1,3 @@
extern crate serde;
#[macro_use]
extern crate serde_derive;
extern crate serde_json;
// Macros useful for testing.
#[macro_use]
mod macros;
@@ -12,6 +7,8 @@ mod hay;
// Utilities for making tests nicer to read and easier to write.
mod util;
// Tests for ripgrep's handling of binary files.
mod binary;
// Tests related to most features in ripgrep. If you're adding something new
// to ripgrep, tests should probably go in here.
mod feature;

Some files were not shown because too many files have changed in this diff Show More