I recently noticed that I was not getting a "full" indexing of files when performing some grep and awk operations.
I perform a find, then I do various parsing using grep and awk.
With the grep, I discovered that I had to modify the command from 'grep' to
grep -a
With awk, I am not sure what to do. The error I get is:
[/DB001_F4|244] Extracting type jpg ...
awk: cmd. line:1: (FILENAME=/DB001_F4/0-DB001_F4-20241219213848.details/0-DB001_F4-20241219213848.files_jpg FNR=485) warning: Invalid multibyte data detected. There may be a mismatch between your data and your locale.
[/DB001_F4|236] Extracting type htm ...
awk: cmd. line:1: (FILENAME=/DB001_F4/0-DB001_F4-20241219213848.details/0-DB001_F4-20241219213848.files_htm FNR=511) warning: Invalid multibyte data detected. There may be a mismatch between your data and your locale.
[/DB001_F4|217] Extracting type txt ...
awk: cmd. line:1: (FILENAME=/DB001_F4/0-DB001_F4-20241219213848.details/0-DB001_F4-20241219213848.files_txt FNR=401) warning: Invalid multibyte data detected. There may be a mismatch between your data and your locale.
[/DB001_F4|216] Extracting type doc ...
awk: cmd. line:1: (FILENAME=/DB001_F4/0-DB001_F4-20241219213848.details/0-DB001_F4-20241219213848.files_doc FNR=1047) warning: Invalid multibyte data detected. There may be a mismatch between your data and your locale.
I am in Canada, and filenames with accents should be "normal" for my character set.
So ... what Locale can I specify that would be appropriate for me that would not cause such errors?
I just checked and there is no reference to any "LC_" in my .bash files. Also, when I do
set | grep LC_
or
env | grep LC_
no defined value is reported, for a any !!!
Is that normal ?
To clarify, my objective is to be able to successfully parse randomly-formed filenames, for a given string regexp, successfully, regardless of presence of smiley icons or other "wacky" or infrequently used characters.
I unfortunately allowed myself to save files in that way, because I was lazy at the time. I should have saved them with no inappropriate characters or spaces at the outset. But I have to deal with that mess now.
Do you know an old joke https://regex.info/blog/2006-09-15/247 ?
As I have mentioned I like Python scripting. That is I'd try to solve the proplem using Python.