Fast text search

Ruud_Baars · October 5, 2020, 2:11pm

Does anyone here know about a Linux command line utility that can full text index (no stop words) a text file and retrieve lines from it using an ‘exact match’ substring?

gabix · October 5, 2020, 5:02pm

Is that what you are looking for?

dnaber · October 5, 2020, 5:30pm

I haven’t tried it, but maybe GitHub - purestorage/4grep works? How large are your files? I assume they are so large that grep is too slow?

Ruud_Baars · October 5, 2020, 7:13pm

I use sed, grep now. For searching 10 GB textfiles repeatedly, these are rather slow.

Ruud_Baars · October 5, 2020, 7:15pm

Thanks, this looks good. I will give it a try tomorrow.

arysin · October 6, 2020, 10:21pm

ripgrep was a bit quite faster than grep for me, but I think my files are split into ~200MB

Ruud_Baars · October 7, 2020, 12:40pm

It is all in one file now. Most tools try to find applicable files faster, but still just grep.

Ruud_Baars · October 7, 2020, 3:00pm

It was worth a try; doc looked good, promising nice speedup. But in my application, grep is still better at it. I just have to patient, or find a smart solution. I have one, but that will take a lot of space; space I currently do not have (yet).

Dominique_PELLE · October 9, 2020, 3:11pm

You wrote “full text index” but grep or sed can’t do that. By “full text search” what is generally meant is to be able to search with any words in any order. SQLite FTS5 or Lucene can do “Full Text Search”. It may be overkill for you. These are libraries and you need to write code to use them. In short, they index your text (reverse index) and you can then perform Full Text Search query in them.

If you only need to grep, then how about splitting your big files into pieces and run multiple grep in parallel? You can do it using xargs. For example, to grep in parallel in many *.txt files, using 8 concurrent processes:

$ find . -name '*.txt' | xargs -P 8 -n 1 grep 'your text to search'

Ruud_Baars · October 10, 2020, 8:37am

That would be an option. Except that my computer has all 8 processors and 32GB om max …
But thanks for the hint.

Sanmayce · October 28, 2020, 3:59pm

Please try the fastest fulltext searcher written in C - open-source and public domain:
https://www.codeproject.com/Articles/5282980/Fastest-Fulltext-Vector-Scalar-Exact-Searcher

It is interesting to me how a native gcc compile behaves, please share some speed stats with that 10GB file…

Ruud_Baars · October 28, 2020, 4:34pm

The gcc compile works, giving a warning. But then… how to use it? What are the parameters to use on the command line?

Sanmayce · October 28, 2020, 6:48pm

It takes two parameters on command line, first is the filename of the file being searched into, the second is the filename of file containing the needle:
https://github.com/BurntSushi/ripgrep/issues/1279#issuecomment-716029709

As you can see, the main() function is as simple as possible, it shows how to exhaust all the matches in e.g. that 10GB file - increasing by 1 after each match found, you can change it to the length of the needle in order to avoid overlapping matches.

Once having the offset (match position within the haystack) you can either dump the surrounding bytes left and right - to form a context, or you can dump the line (if such exists) containing the match.

Ruud_Baars · October 28, 2020, 8:46pm

Sorry, But I do not read code at all. Not a C programmer.
I just need a tool that finds the text in a corpus, and find the lines having 1 or more occurrence of the text that is searched.
This might be a nice base program, but it does not do that, as far as I understand.

Sanmayce · October 28, 2020, 9:20pm

No need to sorry, the thing that is kinda misleading is the “Development” tag at the top of this page.
I want to help all people using console searchers by providing simply the fastest search techniques I know of.
As far as I see, you need a tool, I wrote 6+ years ago Kazahana - the open-source public domain searcher using 16 threads or 1 thread, depending what your resources are:
URL denied

I think the Thread Name suggests… looking up the best available resources.

Ruud_Baars · October 29, 2020, 6:56pm

The url was refused. Searching for the name results in restaurants mostly.