deps: drop bytecount in favor of memchr_iter(..).count()

As of the memchr 2.6 release, its Iterator::count method is specialized
to only count the number of occurrences instead of finding the offset of
each occurrence. This replaces ripgrep's use of the bytecount crate.
While micro-benchmarks suggest that memchr's method has better
throughput than bytecount, it turned out to be an illusion. Namely, on a
~13GB haystack prior to this change:

    $ time rg-bytecount 'You killed my friend, my best friend, my lifelong friend!' OpenSubtitles2018.raw.en --line-number
    441450441:- You killed my friend, my best friend, my lifelong friend!

    real    1.473
    user    1.186
    sys     0.286
    maxmem  12512 MB
    faults  0

And then after:

    $ time rg 'You killed my friend, my best friend, my lifelong friend!' OpenSubtitles2018.raw.en --line-number
    441450441:- You killed my friend, my best friend, my lifelong friend!

    real    1.532
    user    1.280
    sys     0.250
    maxmem  12512 MB
    faults  0

But perf is just about in the same ballpark. That's good enough for me
at the moment in order to drop the extra dependency.

I did this because the marginal cost of adding the Iterator::count()
specialization to memchr was extremely small.
This commit is contained in:
Andrew Gallant
2023-09-02 12:25:34 -04:00
parent 551ad3bada
commit 6cdb99ea61
3 changed files with 4 additions and 12 deletions

View File

@@ -3,7 +3,6 @@ A collection of routines for performing operations on lines.
*/
use bstr::ByteSlice;
use bytecount;
use grep_matcher::{LineTerminator, Match};
/// An iterator over lines in a particular slice of bytes.
@@ -110,7 +109,7 @@ impl LineStep {
/// Count the number of occurrences of `line_term` in `bytes`.
pub fn count(bytes: &[u8], line_term: u8) -> u64 {
bytecount::count(bytes, line_term) as u64
memchr::memchr_iter(line_term, bytes).count() as u64
}
/// Given a line that possibly ends with a terminator, return that line without