Does anyone here know about a Linux command line utility that can full text index (no stop words) a text file and retrieve lines from it using an ‘exact match’ substring?
Is that what you are looking for?
I haven’t tried it, but maybe https://github.com/purestorage/4grep works? How large are your files? I assume they are so large that
grep is too slow?
I use sed, grep now. For searching 10 GB textfiles repeatedly, these are rather slow.
Thanks, this looks good. I will give it a try tomorrow.
ripgrep was a bit quite faster than grep for me, but I think my files are split into ~200MB
It is all in one file now. Most tools try to find applicable files faster, but still just grep.
It was worth a try; doc looked good, promising nice speedup. But in my application, grep is still better at it. I just have to patient, or find a smart solution. I have one, but that will take a lot of space; space I currently do not have (yet).
You wrote “full text index” but grep or sed can’t do that. By “full text search” what is generally meant is to be able to search with any words in any order. SQLite FTS5 or Lucene can do “Full Text Search”. It may be overkill for you. These are libraries and you need to write code to use them. In short, they index your text (reverse index) and you can then perform Full Text Search query in them.
If you only need to grep, then how about splitting your big files into pieces and run multiple grep in parallel? You can do it using xargs. For example, to grep in parallel in many *.txt files, using 8 concurrent processes:
$ find . -name '*.txt' | xargs -P 8 -n 1 grep 'your text to search'
That would be an option. Except that my computer has all 8 processors and 32GB om max …
But thanks for the hint.
Please try the fastest fulltext searcher written in C - open-source and public domain:
It is interesting to me how a native gcc compile behaves, please share some speed stats with that 10GB file…
The gcc compile works, giving a warning. But then… how to use it? What are the parameters to use on the command line?
It takes two parameters on command line, first is the filename of the file being searched into, the second is the filename of file containing the needle:
As you can see, the main() function is as simple as possible, it shows how to exhaust all the matches in e.g. that 10GB file - increasing by 1 after each match found, you can change it to the length of the needle in order to avoid overlapping matches.
Once having the offset (match position within the haystack) you can either dump the surrounding bytes left and right - to form a context, or you can dump the line (if such exists) containing the match.
Sorry, But I do not read code at all. Not a C programmer.
I just need a tool that finds the text in a corpus, and find the lines having 1 or more occurrence of the text that is searched.
This might be a nice base program, but it does not do that, as far as I understand.
No need to sorry, the thing that is kinda misleading is the “Development” tag at the top of this page.
I want to help all people using console searchers by providing simply the fastest search techniques I know of.
As far as I see, you need a tool, I wrote 6+ years ago Kazahana - the open-source public domain searcher using 16 threads or 1 thread, depending what your resources are:
I think the Thread Name suggests… looking up the best available resources.
The url was refused. Searching for the name results in restaurants mostly.