Does anyone here know about a Linux command line utility that can full text index (no stop words) a text file and retrieve lines from it using an ‘exact match’ substring?
Is that what you are looking for?
I haven’t tried it, but maybe https://github.com/purestorage/4grep works? How large are your files? I assume they are so large that
grep is too slow?
I use sed, grep now. For searching 10 GB textfiles repeatedly, these are rather slow.
Thanks, this looks good. I will give it a try tomorrow.
ripgrep was a bit quite faster than grep for me, but I think my files are split into ~200MB
It is all in one file now. Most tools try to find applicable files faster, but still just grep.
It was worth a try; doc looked good, promising nice speedup. But in my application, grep is still better at it. I just have to patient, or find a smart solution. I have one, but that will take a lot of space; space I currently do not have (yet).
You wrote “full text index” but grep or sed can’t do that. By “full text search” what is generally meant is to be able to search with any words in any order. SQLite FTS5 or Lucene can do “Full Text Search”. It may be overkill for you. These are libraries and you need to write code to use them. In short, they index your text (reverse index) and you can then perform Full Text Search query in them.
If you only need to grep, then how about splitting your big files into pieces and run multiple grep in parallel? You can do it using xargs. For example, to grep in parallel in many *.txt files, using 8 concurrent processes:
$ find . -name '*.txt' | xargs -P 8 -n 1 grep 'your text to search'
That would be an option. Except that my computer has all 8 processors and 32GB om max …
But thanks for the hint.