[Chicken-users] Codepoint indices for matched regexps (UTF-8)?

Discussion:

Henry Hu

2018-06-15 13:44:14 UTC

Hello world!

I am trying to use unit irregex to match regular expressions in UTF-8
text. Is anyone familiar with a way to ask for the codepoint indices
rather than byte indices for the match?

For example:

(irregex-match-start-index (irregex-search (irregex "Ä" 'utf8) "ÄÄÄÄÄÄÄ"))

returns 6 when I want it to return 3, since there are 3 characters (6
bytes) before my match.

I tried (use utf8), but it is documented that it doesn't affect irregex and
it sure enough doesn't. I tried using the 'utf8 option while compiling my
regex, but it doesn't change the index returned by
irregex-match-start-index.

Thank you for any ideas you might have!

John Cowan

2018-06-15 23:00:25 UTC

Permalink

On Fri, Jun 15, 2018 at 9:44 AM, Henry Hu <***@mit.edu> wrote:

I tried (use utf8), but it is documented that it doesn't affect irregex and

Post by Henry Hu
it sure enough doesn't. I tried using the 'utf8 option while compiling my
regex, but it doesn't change the index returned by
irregex-match-start-index.

Do "(use utf8)" and then "(import utf8-lolevel)" to get the (undocumented)
low-level utf8 API. The function utf8-offset->index accepts a string and a
byte offset and returns a codepoint index. If you want to go the other
way, utf8-index->offset is also provided.

--
John Cowan http://vrici.lojban.org/~cowan ***@ccil.org
I don't know half of you half as well as I should like, and I like less
than half of you half as well as you deserve. --Bilbo

Martin Schneeweis

2018-06-19 16:23:28 UTC

Permalink

Hi,

can anybody reproduce the following?

I have 3 very simple files A.scm, B.scm, C.scm - all located in the
same directory (source code see attachments)

C uses A and B.

I compile with the following statements:

csc -s A.scm
csc -s B.scm
csc C.scm

No compiler-messages - "C" is executable and prints the expected values

After the following changes in C.scm (C_1.scm):

(use A) => (use (prefix A a:))
(print (say-A)) => (print (a:say-A))

the compilation of C_1.scm prints the following:

Warning: extension `A' is currently not installed

but "C_1" is produced, executable and prints as expected.

- but -

After some more changes in C.scm (C_2.scm):

(use B) => (use (prefix B b:))
(print (say-B)) => (print (b:say-B))

the compilation of C_2.scm does not succeed and prints the following:

Warning: extension `A' is currently not installed

Error: shell command terminated with non-zero exit status 13:
'/usr/bin/chicken' 'C_2.scm' -output-file 'C_2.c'

Unfortunately I upgraded my system today (Arch Linux) and got a new gcc
compiler. I downgraded but the error persists. However I think my
system is the culprit because I am pretty sure I compiled more than 2
prefix-uses (of modules not installed yet) before.

(my chicken version should be the latest - "chicken -version":

(c) 2000-2007, Felix L. Winkelmann
Version 4.13.0 (rev 68eeaaef)
linux-unix-gnu-x86-64 [ 64bit manyargs dload ptables ]
compiled 2017-12-11 on yves.more-magic.net (Linux)
)

Martin

Martin Schneeweis

2018-06-19 19:36:58 UTC

Permalink

Hi,

found the problem (nevertheless strange that 1 "prefix" does not make a

Post by Martin Schneeweis
csc -s A.scm
csc -s B.scm
csc C.scm

The solution is:

csc -s -J A.scm
csc -s A.import.scm

csc -s -J B.scm
csc -s B.import.scm

csc C_2.scm

Now there are 2 warnings

Warning: extension `A' is currently not installed
Warning: extension `B' is currently not installed

but the compile process succeeds.

Martin

Martin Schneeweis

2018-06-22 12:08:22 UTC

Permalink

Hi,

is it possible to define recursive types?

Suppose I have a function

(define my-fun (lambda (lst) ...))

and I want to make sure that every odd element is a symbol and every
even element is a string.

The best I could come up with is something like that:

(define-type l-1 (pair symbol (pair string null)))
(define-type l-2 (pair symbol (pair string l-1)))
(define-type l-3 (pair symbol (pair string l-2)))
...
(define-type l (or ... l-3 l-2 l-1))

(: my-fun (l --> ...))
(define my-fun (lambda (lst) ...))

Another question: How could I define something like a "sequence"?

Suppose I don't want to pass the parameters in a list but rather do
something like this:

(define my-fun (lambda (sym-1 str-1 . rest) ...))

How could I define the type for "rest"?

If I change my definition for "l" (from above) to

(define-type l (or ... l-3 l-2 l-1 null))

I am almost there - except that the compiler expects me to pass the
"rest" actually as a list - but I want to do:

(print (my-fun 'a "a" 'b "b" ...))

Martin

Martin Schneeweis

2018-06-22 20:22:23 UTC

Permalink

Hi,

it seems to me that the dbc egg does not work well when using the
compiler switch "-debug-info".

Without -debug-info my test executable procuces the following output
(something like that I had expected):

----
Contract violation in (add):
...
Error: exception-handler returned
...
dbc-supplier-test.scm:10: signal <--
----

- but -

When compiled *with* "-debug-info", the executable produces no output at
all.

(Maybe the problem has something to do with an exception handler that
returns on a non-continuable exception? - I tried a simple example -
but the simple example works (shows some output) with and without
"-debug-info")

The problem is very simple to reproduce - put the attached files in a
directory and call

csc -sJ %dbc-supplier-test.scm
csc -sJ dbc-supplier-test.scm
csc dbc-client-test.scm
./dbc-client-test

=> expected output

csc -debug-info dbc-client-test.scm
./dbc-client-test

=> no output

I tried different things
- also compiling the ...import.scm-files
- compiling *all* files with "-debug-info"
- removing the *.scm files before compiling the final executable

but nothing works (if "-debug-info" is used)

(By the way - I also don't understand why I need the "(use extras)" in
"dbc-client-test.scm" - but without that the executable complains
"Error: unbound variable: sort")

Martin

Attachments
- output.std-out.txt / output.std-err.txt
full output of the executable
- %dbc-supplier-test.scm
raw definition of "add"
- dbc-supplier-test.scm
contract definition for "add"
- dbc-client-test.scm
the "executable" that uses "add"

Martin Schneeweis

2018-06-23 00:27:06 UTC

Permalink

Hi,

Post by Martin Schneeweis
(define my-fun (lambda (lst) ...))
and I want to make sure that every odd element is a symbol and every
even element is a string.

the best I could come up with is

(: my-fun ((list-of (pair symbol string)) --> string))
(define my-fun (lambda (lst) ...))

Unfortunately a little bit unwieldy on the client side

(my-fun '((a . "a")(b . "b") ... )))

Martin

Martin Schneeweis

2018-06-25 11:14:11 UTC

Permalink

Hi,

is there a way to change the following page:

http://wiki.call-cc.org/chicken-projects/egg-index-4.html

I assume this page is produced programmatically (no edit-link on top of
the page)

The egg "environments" belongs to the category "Unsupported or
redundant" (at least I assume so because
http://bugs.call-cc.org/ticket/643 was merged into the master branch 7
years ago)

Martin

Evan Hanson

2018-08-23 21:48:28 UTC

Permalink

Hi Martin,

Post by Martin Schneeweis
http://wiki.call-cc.org/chicken-projects/egg-index-4.html
I assume this page is produced programmatically (no edit-link on top of
the page)

It is, yeah. The info comes from the eggs themselves.

Post by Martin Schneeweis
The egg "environments" belongs to the category "Unsupported or
redundant" (at least I assume so because
http://bugs.call-cc.org/ticket/643 was merged into the master branch 7
years ago)

That's right, and it was marked as obsolete a while back, but the
versioning of the egg was such that the new category wasn't picked up.
It should be fixed now, or rather the next time the page is generated.

Thanks for pointing that out!

Evan