Recoll - a resource hog?

OldStrummer · 24 August 2022 20:30

I have a lot of PDF files on my server that are product guides and documentation. So many in fact, that I often want to search for a keyword but don't know which manual contains it. Adobe's Acrobat Reader can do this, but it's not supported on Linux (and it's not always a "friendly" program). In researching Linux solutions I came across what seemed like a usable solution that has a UI in addition to lots of features: recoll.

So, I added it (apt install recoll) and after adding it and the needed support files launched it. The first thing it needs to do is index the files. I have (at this writing) just under 500,000 files of all sorts (system, libraries, binaries, etc.) on my server. I just let it go without specifying just PDF files (who knows, if this works as advertised I may want to search scripts, document files, etc.).

I minimized the program and let it run. After a while I did a search and it produced usable results.

But the impact on my system was deeply felt. My entire system got terribly sluggish, even though 'top' and 'ps' and other utilities (like Stacer, bashtop, etc.) reported nothing untoward. I spent the last three days conducting a training, and it was almost painful to wait for screen updates and such. I finally stopped all recoll indexing process, removed the program and deleted my .recoll folder.

My system is back to normal, and I've found I can just use the command-line pdfgrep to perform my searches.

This isn't a call for help, but a posting of my experience. If you use recoll and it doesn't drag your system into molasses-mode, I'd be curious how you did it.

arQon · 28 August 2022 07:56

TBH that just sounds typical of a lot of software: well intentioned, but implemented too poorly to actually be usable in any real-world environment. (See also Windows Update, Windows Antivirus, etc).

I know you said this wasn't a call for help, but if you find yourself forced to reinstall it because otherwise finding anything in your sea of PDFs is impossible:

Install htop so you can see why it's destroying your machine. (Hint: It's IO, and possibly memory exhaustion).
Don't let it grind through 500,000 files for no reason, eesh!
Find the recoll binary, rename it to "recoll.whatever". Create a script named "recoll" there instead, with a body of nice -n 20 ionice -c 3 recoll.whatever.

Alternatively, if it's ONLY going to run when you invoke it, put the script in ~/bin instead, and edit the menu item to point to the script. (This is the better option by a large margin if viable).

ricmarques · 28 August 2022 13:14

Hi, @arQon.

I have two questions / remarks regarding your nice and informative reply to @OldStrummer Specifically, when you wrote the following:

... I have the following remarks:

1 - I believe that you meant to write "ionice" in the command above, instead of "iotop". Manpage of the "ionice" command: https://linux.die.net/man/1/ionice ; Manpage of "iotop": https://linux.die.net/man/1/iotop

2 - If the intention is to be the least favorable to the "recoll" process, then I believe you should use "nice -n 19" (nineteen) instead of "nice -n -20" (negative twenty). From the manpage of "nice" - https://linux.die.net/man/1/nice :

"(...) Run COMMAND with an adjusted niceness, which affects process scheduling. With no COMMAND, print the current niceness. Nicenesses range from -20 (most favorable scheduling) to 19 (least favorable).

-n, --adjustment=N
add integer N to the niceness (default 10) (...)"

So, I believe that the command should actually be the following:

nice -n 19 ionice -c idle recoll.whatever

Do you agree with me?

medoc92 · 28 August 2022 14:12

The recollindex process already nices and ionices itself down to the bottom:

jftp$ ps -eo pid,ni,args | grep recollindex | grep -v grep
 272900  19 recollindex -c /home/...
jftp$ ionice -p 272900
idle

You can check the code here src/index/recollindex.cpp · master · Jean-Francois Dockes / recoll · GitLab if you are so inclined.

So your script suggestion accomplishes exactly nothing. In addition, the program sets itself up as good candidate for the OOM killer, just in case.

So, something else is slowing down the system, and I agree with your suggestion that the issue may be memory exhaustion. The number of files is probably not the issue here, more probably some documents are causing problems (possibly not in the recollindex process itself but in one of the helper programs it uses for extracting data).

I would suggest forcing single-threaded operation by adding the following to ~/.recoll/recoll.conf:

thrQSizes = -1 -1 -1

Then, if the problem persists, have a look at the log (set it at level 3 or 4) to see what the program is doing when it becomes obnoxious, and/or use top to see who is working hard.

There are problem files on many systems, and they are usually not the data that you are really wanting to index, which is a reason to be a bit more selective in what you index.

For example, if you are sure that you only want to index free-standing pdfs, this can be done with the following in the same configuration file:

indexedmimetypes = application/pdf

As usual, the manual is your friend: Parameters affecting what documents we index - - Recoll user manual

arQon · 3 September 2022 06:42

Doh, and doh! That's what I get for posting so late at night. Thanks for cleaning up the mess!

(For reference, 20 is still a valid nice value, it just turns into 19 when used. I use 20 from old habits, though I no longer remember how that habit started).

On a practical level, just nice is probably the better choice now, since nobody bothers tuning machines any more, so anything over 0 is effectively 19 these days.

arQon · 3 September 2022 07:23

Well, that was the point of #1, so we could check if it turned out to not help. I'm not going to install and run every random piece of code in the world just so I know exactly how it behaves, when damage like that described by OP only ever has one of two causes.

The number of files is probably not the issue here

If you don't think doing 500,000 gratuitous seeks, 500,000 gratuitous opens, and 500,000 gratuitous reads just to discover that a file isn't a PDF is a waste of energy, I have some bad news for you.
I agree it's probably not the root of his problem, but we don't know his hardware, so given how bad Linux's exhaustion handling is, especially with regard to nodes; and more importantly how often IO ends up in an expensive and uninterruptible state, it has no possible upside and a potentially massive downside, which is pretty much the definition of a Very Bad Idea.

Anyway, assuming it's a simple swap thrash problem at heart, he's got two likely problems: "bad" PDFs, as you say, which will destroy all the cache in the machine before failing (and then potentially get stuck on the file indefinitely, though it seems unlikely anything more than a year old would have that sort of bug in it still); and "good" PDFs that are simply large enough to destroy all the cache in the machine and then start thrashing the PDF and the indexes in and out of disk a few million times before eventually finishing and moving on, and then doing the same thing again on the next file.

The OOM killer isn't relevant in either scenario really, since by the time that does anything it's already far too late to be useful - though kudos recoll's author for taking it into consideration so it at least doesn't cause Firefox/Chrome to be killed instead.

medoc92 · 3 September 2022 19:16

You're a bit late with this kind of news, I got them around 1986 when I first met Unix V7

What I meant is that the total number of files is not very relevant for current machine performance. What counts mostly is how many you process in parallel, and the total amount of resources needed at any given point. Then if the index becomes really big, the updates may use more resources, because the trees are deeper, more things need to be cached etc. Also, the total number may be relevant if the machine is not being used interactively at the same time, so that a lot of the interactive programs and data will end up being evicted and the system will feel sluggish for a time when normal usage resumes (e.g. in the morning). This is not what was described here.

In the end, you do have to read the files to index them, but the problem reported here is relatively unusual. On my laptop (700k files, 16 GB, SSD), running a full index reset is not noticeable at all from the desktop (doing it right now to make sure).

Knowing more about the hardware, and especially if we're dealing with an SSD or a spinning disk, and how much RAM is available would of course be essential here. Recollindex rarely causes issues on normal hardware, but maybe be it needs to be throttled (e.g. by limiting the thread count) on this particular machine. By default the number of threads is purely based on the number of processors, so it may be too high if the machine is memory or I/O starved.

Then there is also the possibility that the hardware has an actual problem (e.g. disk read retries for some of the data, possibly old stuff not used interactively).

I've never seen a PDF cause this kind of issue with pdftotext. The indexer also sets resource limits on any single file data extraction, but maybe it's set too high for this specific machine. For the record (as this user has given up) the first thing to do would be to limit the parallelism of the indexer through the configuration. And also, make sure that we only process pdfs.

Yes, I think that this is the general idea...

arQon · 4 September 2022 03:50

What I meant is that the total number of files is not very relevant for current machine performance.

You're basing that off your own imaginary idea of "current machine performance", where what you mean is "on my machine, with my SSD, and probably with warm caches".

I have an HDD for bulk storage in this machine, and I needed a file and size count from one userdata tree a few days ago as a reference for something, and that was a painful reminder of how bad random access is on one. It did finally finish counting ~450K files (280GB), but it took about ten minutes. That's on a WD Black, where the reads are only for the entries, not investigating each individual file itself. I'm not interested in burning all the time and energy that would use, but it's easily going to add at least an hour, and probably several times that. Factor in the performance of a 5400RPM 2.5 laptop drive, and a much slower laptop CPU with only two cores, and you're at a minimum of 2 hours as a very generous estimate, throughout which the machine will be at best unpleasant to use and very possibly unusable.

So, no "not very relevant" is wildly far from the truth.
Hell, I've got quad-core Atoms where simply driving the wifi at full speed to rsync to a NAS slows the mouse cursor update rate to 1-second jumps, and makes it literally impossible to double-click something, because by the time it registers the second click it's already well past the timeout.

I get the feeling you're being overly defensive because you think I'm saying recoll itself is to blame here. I'm not: we already passed that point, and will only end up back there if it turns out it's busy thrashing everything because it's e.g. naively trying to index multiple files at the same time with small reads, ping-ponging the heads, etc.

I did consider that, but it's not really how HDDs fail any more. The chances that the drive is "fine, except for a handful of sectors that the PDFs happen to be on", even for a collection as large as the one he's talking about, are so tiny that it's just not credible. For any drive in the post-100MB era (and yes, I do mean MB!), impending death is nearly always readily visible by slowdowns across the board.

The indexer also sets resource limits on any single file data extraction, but maybe it's set too high for this specific machine. For the record (as this user has given up) the first thing to do would be to limit the parallelism of the indexer through the configuration. And also, make sure that we only process pdfs.

You've got that at least partly backwards: PFDs only first, unquestionably. As for the other two, there are too many unknowns to say right now. Bad rlimit would be the cause of the thrash, but uninterruptible IO would cause the same terrible user impact - without htop info to work with I think it's pretty much a coinflip which order to try those in.

medoc92 · 4 September 2022 07:37

"current" did not allude to the machine age, but to what happens while the operation is going on. I do have a bit of experience running stuff on spinning disks (did you miss the V7 part?).

find / -ls might take ten days, and it's still not the question, which is to know what effect it has on current interactive use.

About possible disk issues: pdf storage is not the question here, the user is indexing the whole storage area, so a bad spot is not out of the question.

Anyway, we'll never know because determining the cause would need some experimentation.

Pango · 4 September 2022 08:45

Nothing special, tried several "personal indexers" thru the years, and stuck with Recoll because it has been by far the most efficient.

The only performance problem I can remember has been with btrfs autodefrag with default Ubuntu 20.04 kernel (5.15?), updates were causing huge I/O amplification. I'm using (xanmod) 5.19 kernel now, and the problem is gone.
To any btrfs user that doesn't want to upgrade kernels, I'd recommend turning off copy-on-write (chattr +C) on xapiandb directory before creating the index. Oh, and if you're snapshotting a parent directory, create xapiandb as a subvolume instead of a subdirectory so the Xapian files don't get snapshotted, because snapshots will force copy-on-writes during next updates.

arQon · 7 September 2022 09:39

English is unfortunate like that sometimes.

I do have a bit of experience running stuff on spinning disks (did you miss the V7 part?).

Nope, but given the context...

find / -ls might take ten days, and it's still not the question, which is to know what effect it has on current interactive use.

Right - which is why I referred to the example of a wifi driver that can make a machine completely unresponsive despite only moving 50Mb/s of data. That's less than a tenth of the bandwidth of even a laptop-grade HDD, and all as sequential activity that isn't literally thrashing the physical heads of a drive doing scattershot reads across the entirety of the medium.
Your point is absolutely valid, but uninterruptible IO is exactly what does make interactive use impossible, especially if that IO is also being used to cope with memory exhaustion via swap.

I don't know if you've heard of le9 or not, but it's one of the many attempts to improvement Linux's abysmal swap handling. It's focused almost solely on just not evicting dentries etc, and outperforms the current mess by absolutely staggering amounts from that alone.

About possible disk issues: pdf storage is not the question here, the user is indexing the whole storage area, so a bad spot is not out of the question.

Yes and no, but mostly "no". If you have a 4GB ISO on the disk, exactly one sector of it contains the file header with the PDF magic in it. The other 4GB never gets read, so it doesn't matter if there is a bad block in there.
That's an extreme example, obviously, but you're still looking at a pretty tiny chance that the bad block in a file is that first block, since even just a 50K average file size puts you at 100-1 odds. So aside from "bad piece of disk" being highly unlikely in the first place, it wouldn't cause the sort of problem here described even if it was. Or vice versa, if you prefer.

Anyway, we'll never know because determining the cause would need some experimentation.

Jep. It's just a nice way to unwind.

medoc92 · 7 September 2022 12:29

If the user does not tell it otherwise, recoll will happily index whatever it finds on the disk and knows how to process. For example multigigabyte compressed web log files will be indexed if you do not tell it otherwise (guess how I know this...). Not an ISO though (for now ). Other interesting source of problems are some postscript files. Zip archives, monster Thunderbird folders, many others... This is why, in this case, setting indexedmimetypes = application/pdf is important, because it brings us back to the process which you describe above (at worst a small read per file, and most often not even this).

frazelle09 · 1 June 2023 18:16

Hello! We've been using Recoll for several years now and have not had our system freeze but lately we tried to "reindex" the folders (we only have certain ones selected that contain backups current files).

Below is a copy of our Info screen

Operating System: PCLinuxOS 2023
KDE Plasma Version: 5.27.5
KDE Frameworks Version: 5.106.0
Qt Version: 5.15.6
Kernel Version: 6.3.5-pclos1 (64-bit)
Graphics Platform: X11
Processors: 8 × AMD Ryzen 5 3400G with Radeon Vega Graphics
Memory: 5.7 GiB of RAM
Graphics Processor: AMD Radeon Vega 11 Graphics
Manufacturer: Gigabyte Technology Co., Ltd.
Product Name: B450 AORUS M
We read all of the above thread but are still unsure what we could do. We used the command
thrQSizes = -1 -1 -1
and it seemed to help a little but eventually the system got slower and slower until ... it froze.

If i should have started a new thread, please indicate and if possible, move this one.

Have a beautiful day and be happy!

Bombilla · 1 June 2023 21:29

Welcome @frazelle09 to the community!