mirror of
https://github.com/BurntSushi/ripgrep.git
synced 2025-08-05 14:42:07 -07:00
BREAKING: regex: finally remove CRLF hack
Now that Rust's regex crate finally supports a CRLF mode, we can remove this giant hack in ripgrep to enable it. (And assuredly did not work in all cases.) The way this works in the regex engine is actually subtly different than what ripgrep previously did. Namely, --crlf would previously treat either \r\n or \n as a line terminator. But now it treats \r\n, \n and \r as line terminators. In effect, it is implemented by treating \r and \n as line terminators, but ^ and $ will never match at a position between a \r and a \n. So basically this means that $ will end up matching in more cases than it might be intended too, but I don't expect this to be a big problem in practice. Note that passing --crlf to ripgrep and enabling CRLF mode in the regex via the `R` inline flag (e.g., `(?R:$)`) are subtly different. The `R` flag just controls the regex engine, but --crlf instructs all of ripgrep to use \r\n as a line terminator. There are likely some inconsistencies or corner cases that are wrong as a result of this cognitive dissonance, but we choose to leave well enough alone for now. Fixing this for real will probably require re-thinking how line terminators are handled in ripgrep. For example, one "problem" with how they're handled now is that ripgrep will re-insert its own line terminators when printing output instead of copying the input. This is maybe not so great and perhaps unexpected. (ripgrep probably can't get away with not inserting any line terminators. Users probably expect files that don't end with a line terminator whose last line matches to have a line terminator inserted.)
This commit is contained in:
@@ -6,7 +6,7 @@ use {
|
||||
};
|
||||
|
||||
use crate::{
|
||||
ast::AstAnalysis, crlf::crlfify, error::Error, literal::LiteralSets,
|
||||
ast::AstAnalysis, error::Error, literal::LiteralSets,
|
||||
multi::alternation_literals, non_matching::non_matching_bytes,
|
||||
strip::strip_from_match,
|
||||
};
|
||||
@@ -75,6 +75,7 @@ impl Config {
|
||||
.case_insensitive(self.is_case_insensitive(&analysis))
|
||||
.multi_line(self.multi_line)
|
||||
.dot_matches_new_line(self.dot_matches_new_line)
|
||||
.crlf(self.crlf)
|
||||
.swap_greed(self.swap_greed)
|
||||
.unicode(self.unicode)
|
||||
.build()
|
||||
@@ -88,8 +89,7 @@ impl Config {
|
||||
original: pattern.to_string(),
|
||||
config: self.clone(),
|
||||
analysis,
|
||||
// If CRLF mode is enabled, replace `$` with `(?:\r?$)`.
|
||||
expr: if self.crlf { crlfify(expr) } else { expr },
|
||||
expr,
|
||||
})
|
||||
}
|
||||
|
||||
@@ -167,19 +167,6 @@ impl ConfiguredHIR {
|
||||
non_matching_bytes(&self.expr)
|
||||
}
|
||||
|
||||
/// Returns true if and only if this regex needs to have its match offsets
|
||||
/// tweaked because of CRLF support. Specifically, this occurs when the
|
||||
/// CRLF hack is enabled and the regex is line anchored at the end. In
|
||||
/// this case, matches that end with a `\r` have the `\r` stripped.
|
||||
pub fn needs_crlf_stripped(&self) -> bool {
|
||||
self.config.crlf
|
||||
&& self
|
||||
.expr
|
||||
.properties()
|
||||
.look_set_suffix_any()
|
||||
.contains(hir::Look::EndLF)
|
||||
}
|
||||
|
||||
/// Returns the line terminator configured on this expression.
|
||||
///
|
||||
/// When we have beginning/end anchors (NOT line anchors), the fast line
|
||||
@@ -298,6 +285,7 @@ impl ConfiguredHIR {
|
||||
.octal(self.config.octal)
|
||||
.multi_line(self.config.multi_line)
|
||||
.dot_matches_new_line(self.config.dot_matches_new_line)
|
||||
.crlf(self.config.crlf)
|
||||
.unicode(self.config.unicode);
|
||||
let meta = Regex::config()
|
||||
.utf8_empty(false)
|
||||
@@ -321,6 +309,7 @@ impl ConfiguredHIR {
|
||||
.utf8(false)
|
||||
.multi_line(self.config.multi_line)
|
||||
.dot_matches_new_line(self.config.dot_matches_new_line)
|
||||
.crlf(self.config.crlf)
|
||||
.unicode(self.config.unicode)
|
||||
.build()
|
||||
.parse(pattern)
|
||||
|
Reference in New Issue
Block a user