Filename index parsing: "warning: Invalid multibyte data detected."

ericmarceau · 20 December 2024 18:01

I've come across an indexing "iceberg".

I recently noticed that I was not getting a "full" indexing of files when performing some grep and awk operations.

I perform a find, then I do various parsing using grep and awk.

With the grep, I discovered that I had to modify the command from 'grep' to

grep -a

With awk, I am not sure what to do. The error I get is:

	 [/DB001_F4|244]  Extracting type jpg ...
awk: cmd. line:1: (FILENAME=/DB001_F4/0-DB001_F4-20241219213848.details/0-DB001_F4-20241219213848.files_jpg FNR=485) warning: Invalid multibyte data detected. There may be a mismatch between your data and your locale.

	 [/DB001_F4|236]  Extracting type htm ...
awk: cmd. line:1: (FILENAME=/DB001_F4/0-DB001_F4-20241219213848.details/0-DB001_F4-20241219213848.files_htm FNR=511) warning: Invalid multibyte data detected. There may be a mismatch between your data and your locale.

	 [/DB001_F4|217]  Extracting type txt ...
awk: cmd. line:1: (FILENAME=/DB001_F4/0-DB001_F4-20241219213848.details/0-DB001_F4-20241219213848.files_txt FNR=401) warning: Invalid multibyte data detected. There may be a mismatch between your data and your locale.
	 [/DB001_F4|216]  Extracting type doc ...
awk: cmd. line:1: (FILENAME=/DB001_F4/0-DB001_F4-20241219213848.details/0-DB001_F4-20241219213848.files_doc FNR=1047) warning: Invalid multibyte data detected. There may be a mismatch between your data and your locale.

I am in Canada, and filenames with accents should be "normal" for my character set.

So ... what Locale can I specify that would be appropriate for me that would not cause such errors?

ugnvs · 20 December 2024 18:48

Does

solve your problem?

ericmarceau · 20 December 2024 20:17

Thank you, Eugene.

Does that mean that I have to prefix each instance of grep or awk with "LC_ALL=C" ?

ugnvs · 20 December 2024 20:40

probably, awk only
instead of prefixing you can export environment variable
they say mawk can handle multibyte characters

ericmarceau · 20 December 2024 20:46

So is LC_ALL an extra, above and beyond?
or is it an override, replacing the other LC_* ?

Also, I prefixed the grep with that as well, instead of using 'grep -a', and that eliminated the issue of "binary input".

ugnvs · 20 December 2024 20:50

Placing LC_ALL=C before a command overrides 'global' LC* environment variables' values for that command only.

ericmarceau · 20 December 2024 21:06

Thank you again, Eugene.

I just checked and there is no reference to any "LC_" in my .bash files. Also, when I do

set | grep LC_

or

env | grep LC_

no defined value is reported, for a any !!!

Is that normal ?

To clarify, my objective is to be able to successfully parse randomly-formed filenames, for a given string regexp, successfully, regardless of presence of smiley icons or other "wacky" or infrequently used characters.

I unfortunately allowed myself to save files in that way, because I was lazy at the time. I should have saved them with no inappropriate characters or spaces at the outset. But I have to deal with that mess now.

ugnvs · 20 December 2024 21:21

That is normal. There is LANG variable. You can see details with locale command and change locale globally with sudo dpkg-reconfigure locales

My environment contains a number of LC related variables only due to custom settings in System > Preferences > Personal > Language support menu.

ugnvs · 20 December 2024 21:28

Do you know an old joke https://regex.info/blog/2006-09-15/247 ?
As I have mentioned I like Python scripting. That is I'd try to solve the proplem using Python.