Andrew Gallant 082245dadb cli: replace clap with lexopt and supporting code
ripgrep began it's life with docopt for argument parsing. Then it moved
to Clap and stayed there for a number of years. Clap has served ripgrep
well, and it probably could continue to serve ripgrep well, but I ended
up deciding to move off of it.

Why?

The first time I had the thought of moving off of Clap was during the
2->3->4 transition. I thought the 3.x and 4.x releases were great, but
for me, it ended up moving a little too quickly. Since the release of
4.x was telegraphed around when 3.x came out, I decided to just hold off
and wait to migrate to 4.x instead of doing a 3.x migration followed
shortly by another 4.x migration. Of course, I just never ended up doing
the migration at all. I never got around to it and there just wasn't a
compelling reason for me to upgrade. While I never investigated it, I
saw an upgrade as a non-trivial amount of work in part because I didn't
encapsulate the usage of Clap enough.

The above is just what got me started thinking about it. It wasn't
enough to get me to move off of it on its own. What ended up pushing me
over the edge was a combination of factors:

* As mentioned above, I didn't want to run on the migration treadmill.
This has proven to not be much of an issue, but at the time of the
2->3->4 releases, I didn't know how long Clap 4.x would be out before a
5.x would come out.
* The release of lexopt[1] caught my eye. IMO, that crate demonstrates
exactly how something new can arrive on the scene and just thoroughly
solve a problem minimalistically. It has the docs, the reasoning, the
simple API, the tests and good judgment. It gets all the weird corner
cases right that Clap also gets right (and is part of why I was
originally attracted to Clap).
* I have an overall desire to reduce the size of my dependency tree. In
part because a smaller dependency tree tends to correlate with better
compile times, but also in part because it reduces my reliance and trust
on others. It lets me be the "master" of ripgrep's destiny by reducing
the amount of behavior that is the result of someone else's decision
(whether good or bad).
* I perceived that Clap solves a more general problem than what I
actually need solved. Despite the vast number of flags that ripgrep has,
its requirements are actually pretty simple. We just need simple
switches and flags that support one value. No multi-value flags. No
sub-commands. And probably a lot of other functionality that Clap has
that makes it so flexible for so many different use cases. (I'm being
hand wavy on the last point.)

With all that said, perhaps most importantly, the future of ripgrep
possibly demands a more flexible CLI argument parser. In today's world,
I would really like, for example, flags like `--type` and `--type-not`
to be able to accumulate their repeated values into a single sequence
while respecting the order they appear on the CLI. For example, prior
to this migration, `rg regex-automata -Tlock -ttoml` would not return
results in `Cargo.lock` in this repository because the `-Tlock` always
took priority even though `-ttoml` appeared after it. But with this
migration, `-ttoml` now correctly overrides `-Tlock`. We would like to
do similar things for `-g/--glob` and `--iglob` and potentially even
now introduce a `-G/--glob-not` flag instead of requiring users to use
`!` to negate a glob. (Which I had done originally to work-around this
problem.) And some day, I'd like to add some kind of boolean matching to
ripgrep perhaps similar to how `git grep` does it. (Although I haven't
thought too carefully on a design yet.) In order to do that, I perceive
it would be difficult to implement correctly in Clap.

I believe that this last point is possible to implement correctly in
Clap 2.x, although it is awkward to do so. I have not looked closely
enough at the Clap 4.x API to know whether it's still possible there. In
any case, these were enough reasons to move off of Clap and own more of
the argument parsing process myself.

This did require a few things:

* I had to write my own logic for how arguments are combined into one
single state object. Of course, I wanted this. This was part of the
upside. But it's still code I didn't have to write for Clap.
* I had to write my own shell completion generator.
* I had to write my own `-h/--help` output generator.
* I also had to write my own man page generator. Well, I had to do this
with Clap 2.x too, although my understanding is that Clap 4.x supports
this. With that said, without having tried it, my guess is that I
probably wouldn't have liked the output it generated because I
ultimately had to write most of the roff by hand myself to get the man
page I wanted. (This also had the benefit of dropping the build
dependency on asciidoc/asciidoctor.)

While this is definitely a fair bit of extra work, it overall only cost
me a couple days. IMO, that's a good trade off given that this code is
unlikely to change again in any substantial way. And it should also
allow for more flexible semantics going forward.

Fixes #884, Fixes #1648, Fixes #1701, Fixes #1814, Fixes #1966

[1]: https://docs.rs/lexopt/0.3.0/lexopt/index.html
2023-11-20 23:51:53 -05:00

590 lines
20 KiB
Rust

use std::{borrow::Cow, cell::OnceCell, fmt, io, path::Path, time};
use {
bstr::ByteVec,
grep_matcher::{Captures, LineTerminator, Match, Matcher},
grep_searcher::{
LineIter, Searcher, SinkContext, SinkContextKind, SinkError, SinkMatch,
},
};
#[cfg(feature = "serde")]
use serde::{Serialize, Serializer};
use crate::{hyperlink::HyperlinkPath, MAX_LOOK_AHEAD};
/// A type for handling replacements while amortizing allocation.
pub(crate) struct Replacer<M: Matcher> {
space: Option<Space<M>>,
}
struct Space<M: Matcher> {
/// The place to store capture locations.
caps: M::Captures,
/// The place to write a replacement to.
dst: Vec<u8>,
/// The place to store match offsets in terms of `dst`.
matches: Vec<Match>,
}
impl<M: Matcher> fmt::Debug for Replacer<M> {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
let (dst, matches) = self.replacement().unwrap_or((&[], &[]));
f.debug_struct("Replacer")
.field("dst", &dst)
.field("matches", &matches)
.finish()
}
}
impl<M: Matcher> Replacer<M> {
/// Create a new replacer for use with a particular matcher.
///
/// This constructor does not allocate. Instead, space for dealing with
/// replacements is allocated lazily only when needed.
pub(crate) fn new() -> Replacer<M> {
Replacer { space: None }
}
/// Executes a replacement on the given haystack string by replacing all
/// matches with the given replacement. To access the result of the
/// replacement, use the `replacement` method.
///
/// This can fail if the underlying matcher reports an error.
pub(crate) fn replace_all<'a>(
&'a mut self,
searcher: &Searcher,
matcher: &M,
mut haystack: &[u8],
range: std::ops::Range<usize>,
replacement: &[u8],
) -> io::Result<()> {
// See the giant comment in 'find_iter_at_in_context' below for why we
// do this dance.
let is_multi_line = searcher.multi_line_with_matcher(&matcher);
if is_multi_line {
if haystack[range.end..].len() >= MAX_LOOK_AHEAD {
haystack = &haystack[..range.end + MAX_LOOK_AHEAD];
}
} else {
// When searching a single line, we should remove the line
// terminator. Otherwise, it's possible for the regex (via
// look-around) to observe the line terminator and not match
// because of it.
let mut m = Match::new(0, range.end);
trim_line_terminator(searcher, haystack, &mut m);
haystack = &haystack[..m.end()];
}
{
let &mut Space { ref mut dst, ref mut caps, ref mut matches } =
self.allocate(matcher)?;
dst.clear();
matches.clear();
replace_with_captures_in_context(
matcher,
haystack,
range.clone(),
caps,
dst,
|caps, dst| {
let start = dst.len();
caps.interpolate(
|name| matcher.capture_index(name),
haystack,
replacement,
dst,
);
let end = dst.len();
matches.push(Match::new(start, end));
true
},
)
.map_err(io::Error::error_message)?;
}
Ok(())
}
/// Return the result of the prior replacement and the match offsets for
/// all replacement occurrences within the returned replacement buffer.
///
/// If no replacement has occurred then `None` is returned.
pub(crate) fn replacement<'a>(
&'a self,
) -> Option<(&'a [u8], &'a [Match])> {
match self.space {
None => None,
Some(ref space) => {
if space.matches.is_empty() {
None
} else {
Some((&space.dst, &space.matches))
}
}
}
}
/// Clear space used for performing a replacement.
///
/// Subsequent calls to `replacement` after calling `clear` (but before
/// executing another replacement) will always return `None`.
pub(crate) fn clear(&mut self) {
if let Some(ref mut space) = self.space {
space.dst.clear();
space.matches.clear();
}
}
/// Allocate space for replacements when used with the given matcher and
/// return a mutable reference to that space.
///
/// This can fail if allocating space for capture locations from the given
/// matcher fails.
fn allocate(&mut self, matcher: &M) -> io::Result<&mut Space<M>> {
if self.space.is_none() {
let caps =
matcher.new_captures().map_err(io::Error::error_message)?;
self.space = Some(Space { caps, dst: vec![], matches: vec![] });
}
Ok(self.space.as_mut().unwrap())
}
}
/// A simple layer of abstraction over either a match or a contextual line
/// reported by the searcher.
///
/// In particular, this provides an API that unions the `SinkMatch` and
/// `SinkContext` types while also exposing a list of all individual match
/// locations.
///
/// While this serves as a convenient mechanism to abstract over `SinkMatch`
/// and `SinkContext`, this also provides a way to abstract over replacements.
/// Namely, after a replacement, a `Sunk` value can be constructed using the
/// results of the replacement instead of the bytes reported directly by the
/// searcher.
#[derive(Debug)]
pub(crate) struct Sunk<'a> {
bytes: &'a [u8],
absolute_byte_offset: u64,
line_number: Option<u64>,
context_kind: Option<&'a SinkContextKind>,
matches: &'a [Match],
original_matches: &'a [Match],
}
impl<'a> Sunk<'a> {
#[inline]
pub(crate) fn empty() -> Sunk<'static> {
Sunk {
bytes: &[],
absolute_byte_offset: 0,
line_number: None,
context_kind: None,
matches: &[],
original_matches: &[],
}
}
#[inline]
pub(crate) fn from_sink_match(
sunk: &'a SinkMatch<'a>,
original_matches: &'a [Match],
replacement: Option<(&'a [u8], &'a [Match])>,
) -> Sunk<'a> {
let (bytes, matches) =
replacement.unwrap_or_else(|| (sunk.bytes(), original_matches));
Sunk {
bytes,
absolute_byte_offset: sunk.absolute_byte_offset(),
line_number: sunk.line_number(),
context_kind: None,
matches,
original_matches,
}
}
#[inline]
pub(crate) fn from_sink_context(
sunk: &'a SinkContext<'a>,
original_matches: &'a [Match],
replacement: Option<(&'a [u8], &'a [Match])>,
) -> Sunk<'a> {
let (bytes, matches) =
replacement.unwrap_or_else(|| (sunk.bytes(), original_matches));
Sunk {
bytes,
absolute_byte_offset: sunk.absolute_byte_offset(),
line_number: sunk.line_number(),
context_kind: Some(sunk.kind()),
matches,
original_matches,
}
}
#[inline]
pub(crate) fn context_kind(&self) -> Option<&'a SinkContextKind> {
self.context_kind
}
#[inline]
pub(crate) fn bytes(&self) -> &'a [u8] {
self.bytes
}
#[inline]
pub(crate) fn matches(&self) -> &'a [Match] {
self.matches
}
#[inline]
pub(crate) fn original_matches(&self) -> &'a [Match] {
self.original_matches
}
#[inline]
pub(crate) fn lines(&self, line_term: u8) -> LineIter<'a> {
LineIter::new(line_term, self.bytes())
}
#[inline]
pub(crate) fn absolute_byte_offset(&self) -> u64 {
self.absolute_byte_offset
}
#[inline]
pub(crate) fn line_number(&self) -> Option<u64> {
self.line_number
}
}
/// A simple encapsulation of a file path used by a printer.
///
/// This represents any transforms that we might want to perform on the path,
/// such as converting it to valid UTF-8 and/or replacing its separator with
/// something else. This allows us to amortize work if we are printing the
/// file path for every match.
///
/// In the common case, no transformation is needed, which lets us avoid
/// the allocation. Typically, only Windows requires a transform, since
/// it's fraught to access the raw bytes of a path directly and first need
/// to lossily convert to UTF-8. Windows is also typically where the path
/// separator replacement is used, e.g., in cygwin environments to use `/`
/// instead of `\`.
///
/// Users of this type are expected to construct it from a normal `Path`
/// found in the standard library. It can then be written to any `io::Write`
/// implementation using the `as_bytes` method. This achieves platform
/// portability with a small cost: on Windows, paths that are not valid UTF-16
/// will not roundtrip correctly.
#[derive(Clone, Debug)]
pub(crate) struct PrinterPath<'a> {
// On Unix, we can re-materialize a `Path` from our `Cow<'a, [u8]>` with
// zero cost, so there's no point in storing it. At time of writing,
// OsStr::as_os_str_bytes (and its corresponding constructor) are not
// stable yet. Those would let us achieve the same end portably. (As long
// as we keep our UTF-8 requirement on Windows.)
#[cfg(not(unix))]
path: &'a Path,
bytes: Cow<'a, [u8]>,
hyperlink: OnceCell<Option<HyperlinkPath>>,
}
impl<'a> PrinterPath<'a> {
/// Create a new path suitable for printing.
pub(crate) fn new(path: &'a Path) -> PrinterPath<'a> {
PrinterPath {
#[cfg(not(unix))]
path,
// N.B. This is zero-cost on Unix and requires at least a UTF-8
// check on Windows. This doesn't allocate on Windows unless the
// path is invalid UTF-8 (which is exceptionally rare).
bytes: Vec::from_path_lossy(path),
hyperlink: OnceCell::new(),
}
}
/// Set the separator on this path.
///
/// When set, `PrinterPath::as_bytes` will return the path provided but
/// with its separator replaced with the one given.
pub(crate) fn with_separator(
mut self,
sep: Option<u8>,
) -> PrinterPath<'a> {
/// Replace the path separator in this path with the given separator
/// and do it in place. On Windows, both `/` and `\` are treated as
/// path separators that are both replaced by `new_sep`. In all other
/// environments, only `/` is treated as a path separator.
fn replace_separator(bytes: &[u8], sep: u8) -> Vec<u8> {
let mut bytes = bytes.to_vec();
for b in bytes.iter_mut() {
if *b == b'/' || (cfg!(windows) && *b == b'\\') {
*b = sep;
}
}
bytes
}
let Some(sep) = sep else { return self };
self.bytes = Cow::Owned(replace_separator(self.as_bytes(), sep));
self
}
/// Return the raw bytes for this path.
pub(crate) fn as_bytes(&self) -> &[u8] {
&self.bytes
}
/// Return this path as a hyperlink.
///
/// Note that a hyperlink may not be able to be created from a path.
/// Namely, computing the hyperlink may require touching the file system
/// (e.g., for path canonicalization) and that can fail. This failure is
/// silent but is logged.
pub(crate) fn as_hyperlink(&self) -> Option<&HyperlinkPath> {
self.hyperlink
.get_or_init(|| HyperlinkPath::from_path(self.as_path()))
.as_ref()
}
/// Return this path as an actual `Path` type.
fn as_path(&self) -> &Path {
#[cfg(unix)]
fn imp<'p>(p: &'p PrinterPath<'_>) -> &'p Path {
use std::{ffi::OsStr, os::unix::ffi::OsStrExt};
Path::new(OsStr::from_bytes(p.as_bytes()))
}
#[cfg(not(unix))]
fn imp<'p>(p: &'p PrinterPath<'_>) -> &'p Path {
p.path
}
imp(self)
}
}
/// A type that provides "nicer" Display and Serialize impls for
/// std::time::Duration. The serialization format should actually be compatible
/// with the Deserialize impl for std::time::Duration, since this type only
/// adds new fields.
#[derive(Clone, Copy, Debug, Default, PartialEq, Eq)]
pub(crate) struct NiceDuration(pub time::Duration);
impl fmt::Display for NiceDuration {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
write!(f, "{:0.6}s", self.fractional_seconds())
}
}
impl NiceDuration {
/// Returns the number of seconds in this duration in fraction form.
/// The number to the left of the decimal point is the number of seconds,
/// and the number to the right is the number of milliseconds.
fn fractional_seconds(&self) -> f64 {
let fractional = (self.0.subsec_nanos() as f64) / 1_000_000_000.0;
self.0.as_secs() as f64 + fractional
}
}
#[cfg(feature = "serde")]
impl Serialize for NiceDuration {
fn serialize<S: Serializer>(&self, ser: S) -> Result<S::Ok, S::Error> {
use serde::ser::SerializeStruct;
let mut state = ser.serialize_struct("Duration", 2)?;
state.serialize_field("secs", &self.0.as_secs())?;
state.serialize_field("nanos", &self.0.subsec_nanos())?;
state.serialize_field("human", &format!("{}", self))?;
state.end()
}
}
/// A simple formatter for converting `u64` values to ASCII byte strings.
///
/// This avoids going through the formatting machinery which seems to
/// substantially slow things down.
///
/// The `itoa` crate does the same thing as this formatter, but is a bit
/// faster. We roll our own which is a bit slower, but gets us enough of a win
/// to be satisfied with and with pure safe code.
#[derive(Debug)]
pub(crate) struct DecimalFormatter {
buf: [u8; Self::MAX_U64_LEN],
start: usize,
}
impl DecimalFormatter {
/// Discovered via `u64::MAX.to_string().len()`.
const MAX_U64_LEN: usize = 20;
/// Create a new decimal formatter for the given 64-bit unsigned integer.
pub(crate) fn new(mut n: u64) -> DecimalFormatter {
let mut buf = [0; Self::MAX_U64_LEN];
let mut i = buf.len();
loop {
i -= 1;
let digit = u8::try_from(n % 10).unwrap();
n /= 10;
buf[i] = b'0' + digit;
if n == 0 {
break;
}
}
DecimalFormatter { buf, start: i }
}
/// Return the decimal formatted as an ASCII byte string.
pub(crate) fn as_bytes(&self) -> &[u8] {
&self.buf[self.start..]
}
}
/// Trim prefix ASCII spaces from the given slice and return the corresponding
/// range.
///
/// This stops trimming a prefix as soon as it sees non-whitespace or a line
/// terminator.
pub(crate) fn trim_ascii_prefix(
line_term: LineTerminator,
slice: &[u8],
range: Match,
) -> Match {
fn is_space(b: u8) -> bool {
match b {
b'\t' | b'\n' | b'\x0B' | b'\x0C' | b'\r' | b' ' => true,
_ => false,
}
}
let count = slice[range]
.iter()
.take_while(|&&b| -> bool {
is_space(b) && !line_term.as_bytes().contains(&b)
})
.count();
range.with_start(range.start() + count)
}
pub(crate) fn find_iter_at_in_context<M, F>(
searcher: &Searcher,
matcher: M,
mut bytes: &[u8],
range: std::ops::Range<usize>,
mut matched: F,
) -> io::Result<()>
where
M: Matcher,
F: FnMut(Match) -> bool,
{
// This strange dance is to account for the possibility of look-ahead in
// the regex. The problem here is that mat.bytes() doesn't include the
// lines beyond the match boundaries in mulit-line mode, which means that
// when we try to rediscover the full set of matches here, the regex may no
// longer match if it required some look-ahead beyond the matching lines.
//
// PCRE2 (and the grep-matcher interfaces) has no way of specifying an end
// bound of the search. So we kludge it and let the regex engine search the
// rest of the buffer... But to avoid things getting too crazy, we cap the
// buffer.
//
// If it weren't for multi-line mode, then none of this would be needed.
// Alternatively, if we refactored the grep interfaces to pass along the
// full set of matches (if available) from the searcher, then that might
// also help here. But that winds up paying an upfront unavoidable cost for
// the case where matches don't need to be counted. So then you'd have to
// introduce a way to pass along matches conditionally, only when needed.
// Yikes.
//
// Maybe the bigger picture thing here is that the searcher should be
// responsible for finding matches when necessary, and the printer
// shouldn't be involved in this business in the first place. Sigh. Live
// and learn. Abstraction boundaries are hard.
let is_multi_line = searcher.multi_line_with_matcher(&matcher);
if is_multi_line {
if bytes[range.end..].len() >= MAX_LOOK_AHEAD {
bytes = &bytes[..range.end + MAX_LOOK_AHEAD];
}
} else {
// When searching a single line, we should remove the line terminator.
// Otherwise, it's possible for the regex (via look-around) to observe
// the line terminator and not match because of it.
let mut m = Match::new(0, range.end);
trim_line_terminator(searcher, bytes, &mut m);
bytes = &bytes[..m.end()];
}
matcher
.find_iter_at(bytes, range.start, |m| {
if m.start() >= range.end {
return false;
}
matched(m)
})
.map_err(io::Error::error_message)
}
/// Given a buf and some bounds, if there is a line terminator at the end of
/// the given bounds in buf, then the bounds are trimmed to remove the line
/// terminator.
pub(crate) fn trim_line_terminator(
searcher: &Searcher,
buf: &[u8],
line: &mut Match,
) {
let lineterm = searcher.line_terminator();
if lineterm.is_suffix(&buf[*line]) {
let mut end = line.end() - 1;
if lineterm.is_crlf() && end > 0 && buf.get(end - 1) == Some(&b'\r') {
end -= 1;
}
*line = line.with_end(end);
}
}
/// Like `Matcher::replace_with_captures_at`, but accepts an end bound.
///
/// See also: `find_iter_at_in_context` for why we need this.
fn replace_with_captures_in_context<M, F>(
matcher: M,
bytes: &[u8],
range: std::ops::Range<usize>,
caps: &mut M::Captures,
dst: &mut Vec<u8>,
mut append: F,
) -> Result<(), M::Error>
where
M: Matcher,
F: FnMut(&M::Captures, &mut Vec<u8>) -> bool,
{
let mut last_match = range.start;
matcher.captures_iter_at(bytes, range.start, caps, |caps| {
let m = caps.get(0).unwrap();
if m.start() >= range.end {
return false;
}
dst.extend(&bytes[last_match..m.start()]);
last_match = m.end();
append(caps, dst)
})?;
let end = std::cmp::min(bytes.len(), range.end);
dst.extend(&bytes[last_match..end]);
Ok(())
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn custom_decimal_format() {
let fmt = |n: u64| {
let bytes = DecimalFormatter::new(n).as_bytes().to_vec();
String::from_utf8(bytes).unwrap()
};
let std = |n: u64| n.to_string();
let ints = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 100, 123, u64::MAX];
for n in ints {
assert_eq!(std(n), fmt(n));
}
}
}