Henry Hu
2018-07-03 21:40:13 UTC
Hello,
I am getting a strange result from unit irregex having to do with matching
character sets.
I recently upgraded to 4.13.0 to get the bug fix having to do with an extra
empty list in the SRE: https://github.com/ashinn/irregex/pull/18. I was
happy to find that "[]" bracketed character sets without "^" are working
beautifully! I am, however, observing strange things with the "^"
exclusion character.
The ⟠character has three bytes and when displayed in byte form, looks like
`\342\276\200`:
INPUT:
(use irregex) ; Not doing (use utf8) because I want start-index and
end-index to function correctly
(irregex-match-substring (irregex-search (irregex "[^âŸ]" 'utf8) "âŸâŸâŸ"))
EXPECTED OUTPUT:
Considering a UTF-8 character as a single character anywhere it appears:
`#f`
Considering a UTF-8 character as a single character sometimes and a byte
string sometimes: `<the first byte of âŸ>` (displayed as `\342`), or #f
Considering a UTF-8 character as a byte string always: #f
OUTPUT:
`<the first byte of âŸ><the second byte of âŸ>` (looks like `\342\276`)
EVEN WORSE:
(irregex-match-substring (irregex-search (irregex "[^Ã]" 'utf8) "Ã")) --->
"Ã" ; A two-byte character
Am I doing something wrong? Is "^" not designed to be used with multibyte
characters? Why would it return two bytes and not 0, 1, or 3?
Thank you!
I am getting a strange result from unit irregex having to do with matching
character sets.
I recently upgraded to 4.13.0 to get the bug fix having to do with an extra
empty list in the SRE: https://github.com/ashinn/irregex/pull/18. I was
happy to find that "[]" bracketed character sets without "^" are working
beautifully! I am, however, observing strange things with the "^"
exclusion character.
The ⟠character has three bytes and when displayed in byte form, looks like
`\342\276\200`:
INPUT:
(use irregex) ; Not doing (use utf8) because I want start-index and
end-index to function correctly
(irregex-match-substring (irregex-search (irregex "[^âŸ]" 'utf8) "âŸâŸâŸ"))
EXPECTED OUTPUT:
Considering a UTF-8 character as a single character anywhere it appears:
`#f`
Considering a UTF-8 character as a single character sometimes and a byte
string sometimes: `<the first byte of âŸ>` (displayed as `\342`), or #f
Considering a UTF-8 character as a byte string always: #f
OUTPUT:
`<the first byte of âŸ><the second byte of âŸ>` (looks like `\342\276`)
EVEN WORSE:
(irregex-match-substring (irregex-search (irregex "[^Ã]" 'utf8) "Ã")) --->
"Ã" ; A two-byte character
Am I doing something wrong? Is "^" not designed to be used with multibyte
characters? Why would it return two bytes and not 0, 1, or 3?
Thank you!