Find Non Ascii Characters In Text File Notepad App

May 21, 2018  Notepad has ANSI (= ASCII & Extended ASCII) as its default setting for saving text files. If the text file contains non-ANSI characters then it gives a warningwhich if you accidentally bypass and save the file with the ANSI encoding, all non-ANSI characters become unreadable.

Active8 days ago

I have several very large XML files and I'm trying to find the lines that contain non-ASCII characters. I've tried the following:

But this returns every line in the file, regardless of whether the line contains a character in the range specified.

Do I have the syntax wrong or am I doing something else wrong? I've also tried:

(with both single and double quotes surrounding the pattern).

kenorb
80.4k34 gold badges448 silver badges461 bronze badges
pconreypconrey
2,5527 gold badges22 silver badges30 bronze badges

11 Answers

You can use the command:

This will give you the line number, and will highlight non-ascii chars in red.

In some systems, depending on your settings, the above will not work, so you can grep by the inverse

Note also, that the important bit is the -P flag which equates to --perl-regexp: so it will interpret your pattern as a Perl regular expression. It also says that

this is highly experimental and grep -P may warn of unimplemented features.

jerrymousejerrymouse
9,20710 gold badges52 silver badges70 bronze badges

Instead of making assumptions about the byte range of non-ASCII characters, as most of the above solutions do, it's slightly better IMO to be explicit about the actual byte range of ASCII characters instead.

So the first solution for instance would become:

(which basically greps for any character outside of the hexadecimal ASCII range: from x00 up to x7F)

On Mountain Lion that won't work (due to the lack of PCRE support in BSD grep), but with pcre installed via Homebrew, the following will work just as well:

Any pros or cons that anyone can think off?

pvandenberkpvandenberk
3,6532 gold badges21 silver badges14 bronze badges

The following works for me:

Non

Non-ASCII characters start at 0x80 and go to 0xFF when looking at bytes. Grep (and family) don't do Unicode processing to merge multi-byte characters into a single entity for regex matching as you seem to want. The -P option in my grep allows the use of xdd escapes in character classes to accomplish what you want.

ThelemaThelema
6,1095 gold badges22 silver badges34 bronze badges
noquerynoquery
1,1951 gold badge11 silver badges16 bronze badges

The easy way is to define a non-ASCII character... as a character that is not an ASCII character.

Add a tab after the ^ if necessary.

Setting LC_COLLATE=C avoids nasty surprises about the meaning of character ranges in many locales. Setting LC_CTYPE=C is necessary to match single-byte characters — otherwise the command would miss invalid byte sequences in the current encoding. Setting LC_ALL=C avoids locale-dependent effects altogether.

GillesGilles
79.6k19 gold badges171 silver badges209 bronze badges

Here is another variant I found that produced completely different results from the grep search for [x80-xFF] in the accepted answer. Perhaps it will be useful to someone to find additional non-ascii characters:

grep --color='auto' -P -n '[^[:ascii:]]' myfile.txt

Note: my computer's grep (a Mac) did not have -P option, so I did brew install grep and started the call above with ggrep instead of grep.

ryanmryanm

The following code works:

Replace /tmp with the name of the directory you want to search through.

bfontaine
10.3k9 gold badges49 silver badges75 bronze badges
user7417071user7417071

Searching for non-printable chars.

I agree with Harvey above buried in the comments, it is often more useful to search for non-printable characters OR it is easy to think non-ASCII when you really should be thinking non-printable. Harvey suggests 'use this: '[^n -~]'. Add r for DOS text files. That translates to '[^x0Ax020-x07E]' and add x0D for CR'

Also, adding -c (show count of patterns matched) to grep is useful when searching for non-printable chars as the strings matched can mess up terminal.

I found adding range 0-8 and 0x0e-0x1f (to the 0x80-0xff range) is a useful pattern. This excludes the TAB, CR and LF and one or two more uncommon printable chars. So IMHO a quite a useful (albeit crude) grep pattern is THIS one:

ACTUALLY, generally you will need to do this:

breakdown:

E.g. practical example of use find to grep all files under current directory:

You may wish to adjust the grep at times. e.g. BS(0x08 - backspace) char used in some printable files or to exclude VT(0x0B - vertical tab). The BEL(0x07) and ESC(0x1B) chars can also be deemed printable in some cases.

UPDATE: I had to revisit this recently. And, YYMV depending on terminal settings/solar weather forecast BUT . . I noticed that grep was not finding many unicode or extended characters. Even though intuitively they should match the range 0x80 to 0xff, 3 and 4 byte unicode characters were not matched. ??? Can anyone explain this? YES. @frabjous asked and @calandoa explained that LC_ALL=C should be used to set locale for the command to make grep match.

e.g. my locale LC_ALL= empty

grep with LC_ALL= empty matches 2 byte encoded chars but not 3 and 4 byte encoded:

grep with LC_ALL=C does seem to match all extended characters that you would want:

THIS perl match (partially found elsewhere on stackoverflow) OR the inverse grep on the top answer DO seem to find ALL the ~weird~ and ~wonderful~ 'non-ascii' characters without setting locale:

SO the preferred non-ascii char finders:

$ perl -ne 'print '$. $_' if m/[x00-x08x0E-x1Fx80-xFF]/' notes_unicode_emoji_test

as in top answer, the inverse grep:

$ grep --color='auto' -P -n '[^x00-x7F]' notes_unicode_emoji_test

as in top answer but WITH LC_ALL=C:

Find Non Ascii Characters In Text File Notepad App Download

$ LC_ALL=C grep --color='auto' -P -n '[x80-xFF]' notes_unicode_emoji_test

gaoithegaoithe

Strangely, I had to do this today! I ended up using Perl because I couldn't get grep/egrep to work (even in -P mode). Something like:

For unicode characters (like u2212 in example below) use this:

dma_k
6,79813 gold badges57 silver badges110 bronze badges
dtydty
16.6k6 gold badges48 silver badges78 bronze badges

It could be interesting to know how to search for one unicode character. This command can help. You only need to know the code in UTF8

miken32
27.1k10 gold badges53 silver badges75 bronze badges
arezaearezae

Finding all non-ascii characters gives the impression that one is either looking for unicode strings or intends to strip said characters individually.

For the former, try one of these (variable file is used for automation):

Vanilla grep doesn't work correctly without LC_ALL=C as noted in the previous answers.

ASCII range is x00-x7F, space is x20, since strings have spaces the negative range omits it.

Non-ASCII range is x80-xFF, since strings have spaces the positive range adds it.

String is presumed to be at least 7 consecutive characters within the range. {7,}.

For shell readable output, uchardet $file returns a guess of the file encoding which is passed to iconv for automatic interpolation.

Find Non Ascii Characters In Text File Notepad Apps

noabodynoabody

Not the answer you're looking for? Browse other questions tagged regexunixunicodegrep or ask your own question.