Author Topic: GCC4TI string manipulation functions  (Read 12004 times)

0 Members and 1 Guest are viewing this topic.

Offline TC01

  • LV6 Super Member (Next: 500)
  • ******
  • Posts: 344
  • Rating: +9/-0
    • View Profile
GCC4TI string manipulation functions
« on: July 19, 2010, 05:52:26 pm »
The first 68k C question in this subforum! (the flash apps one I posted was in the Other Calculator Help and Support forum)

Anyway, I'm having a little trouble understanding some of the string manipulation functions in GCC4TI (string.h). It's probably because of the way strings are treated in C (as arrays) which isn't really what I'm used to, being a programmer in higher level languages (Python and VB). Also because I haven't really used pointers before in programming.

Specifically, I want to read a text file (*.89t) and split it line by line, then do operations on each line. I've gotten my program to a point where it uses a dialog box to get the name of the text file, opens and reads it, and then prints the text to the screen (this last bit is just for debugging). It's the "splitting line-by-line" that's causing me problems.

I've realized that the text files use "\r" as newline characters (at least, I think they do) and so I thought to use strscn() to get the length of the string up until the "line", then use strcpy() to copy the line into another string and strpbrk() to create a substring without the first line.

The first problem is that if you type "\", GCC4TI ignores the closing quotation mark, assuming another character is coming after it. I've had to use "\r"- but the problem of this is that if the line has "r" in it, strscn() would stop early (because it's characters from the second string, not the second string as a whole). Is there any way to solve this? Or am I wrong and, in fact, it would only stop when it found both "\" and "r"?

Then, the second problem is that I'm unsure what exactly strpbrk() does. The docs say it "returns a pointer to the first occurrence of any of the characters in s2"- is this just a pointer to that one character, or a pointer to everything from that character to the end of the string?



The userbars in my sig are links embedded links.

And in addition to calculator (and Python!) stuff, I mod Civilization 4 (frequently with Python).

Offline TC01

  • LV6 Super Member (Next: 500)
  • ******
  • Posts: 344
  • Rating: +9/-0
    • View Profile
Re: GCC4TI string manipulation functions
« Reply #1 on: July 21, 2010, 11:10:35 pm »
Well, I've figured most of this out on my own.

Except for one thing- strcspn is not working as expected. I have the code below (seperate project I created to test it):

Code: [Select]
// A test of the strcspn function, and how it works
void _main(void)
{
const char string[10] = "Abc\Defg";
unsigned long length = strcspn(string, "\\");
clrscr();
printf("%d", (int)length);
ngetchx();
}

"\\" is apparently equivalent to a single backslash- at least, it is when I print it out. Even if it isn't, it shouldn't matter what happens here because the documentation says strcspn should return the length of string 1 containing no characters from string 2. But evidently it does... the program prints out 7.

What's wrong here? How can I read a string up to a backslash? strcspn works fine if I have it run up to the "D" in the above test... can I fix this?



The userbars in my sig are links embedded links.

And in addition to calculator (and Python!) stuff, I mod Civilization 4 (frequently with Python).

Offline TC01

  • LV6 Super Member (Next: 500)
  • ******
  • Posts: 344
  • Rating: +9/-0
    • View Profile
Re: GCC4TI string manipulation functions
« Reply #2 on: July 22, 2010, 11:35:20 am »
Well... I've figured this out- \ isn't picked up because a single backslash plus the next character (like \n or \r) is a seperate character. The actual string would need to have a double backslash for it to work. Fortunately, if \r is a single character then I can look for it using strncpy() and strcspn() without problems.

Don't know if I should even bother posting updates since there aren't really any 68k C programmers around... but I will anyway if I have more issues, just in case someone actually looks at and replies to this.



The userbars in my sig are links embedded links.

And in addition to calculator (and Python!) stuff, I mod Civilization 4 (frequently with Python).

Offline Lionel Debroux

  • LV11 Super Veteran (Next: 3000)
  • ***********
  • Posts: 2135
  • Rating: +290/-45
    • View Profile
    • TI-Chess Team
Re: GCC4TI string manipulation functions
« Reply #3 on: July 22, 2010, 12:40:05 pm »
Sorry I missed your topic, I don't attend this section on a regular basis.
Indeed, '\r' is a single character, just like '\\'.
Member of the TI-Chess Team.
Co-maintainer of GCC4TI (GCC4TI online documentation), TILP and TIEmu.
Co-admin of TI-Planet.

Offline calcdude84se

  • Needs Motivation
  • LV11 Super Veteran (Next: 3000)
  • ***********
  • Posts: 2272
  • Rating: +78/-13
  • Wondering where their free time went...
    • View Profile
Re: GCC4TI string manipulation functions
« Reply #4 on: July 22, 2010, 04:30:49 pm »
Yeah, things like '\r', '\\', '\"', '\t', and '\'' are all single characters, and are called "escape sequences." Without them you couldn't easily embed newlines, quotes, tabs, etc. in strings.
« Last Edit: July 22, 2010, 04:31:24 pm by calcdude84se »
"People think computers will keep them from making mistakes. They're wrong. With computers you make mistakes faster."
-Adam Osborne
Spoiler For "PartesOS links":
I'll put it online when it does something.

Offline TC01

  • LV6 Super Member (Next: 500)
  • ******
  • Posts: 344
  • Rating: +9/-0
    • View Profile
Re: GCC4TI string manipulation functions
« Reply #5 on: July 22, 2010, 09:57:39 pm »
Okay, thanks for that information- I knew about \n, of course... didn't realize there were other escape sequences (apart from \\).

It seems like the "split text file into lines" part of the program is working now, but if I have any other questions I'll post them.



The userbars in my sig are links embedded links.

And in addition to calculator (and Python!) stuff, I mod Civilization 4 (frequently with Python).

Offline calcdude84se

  • Needs Motivation
  • LV11 Super Veteran (Next: 3000)
  • ***********
  • Posts: 2272
  • Rating: +78/-13
  • Wondering where their free time went...
    • View Profile
Re: GCC4TI string manipulation functions
« Reply #6 on: July 22, 2010, 10:02:42 pm »
Nice to hear it's working! Yeah, just ask if you have more questions, we'll gladly answer :D
"People think computers will keep them from making mistakes. They're wrong. With computers you make mistakes faster."
-Adam Osborne
Spoiler For "PartesOS links":
I'll put it online when it does something.

Offline TC01

  • LV6 Super Member (Next: 500)
  • ******
  • Posts: 344
  • Rating: +9/-0
    • View Profile
Re: GCC4TI string manipulation functions
« Reply #7 on: July 24, 2010, 01:45:24 pm »
Unfortunately I'm having another issue- although not with the "read text files into lines" part, the "parse lines into 83+ tokens" part (see here for what my project is).

The way I'm removing characters from a string is strpbrk(). So my function to parse lines works recursively and does the following:

1. Send the line to a function to check if it's a token.
2. If it is a valid token, adjust the array of hexadecimel tokens and return.
3. If it is not, truncate it by removing a character from the end each time using strncpy() until it finds a token.
4. When a token is found, adjust the array of hexadecimel tokens, use strpbrk() to remove the part of the string that's a token (by getting the first character in the string that's not part of the token), and call the entire function again, this time for the rest of the string.

The trouble is if the "next character" in the string is the same as one of the characters that's part of the token, it won't work. So if the line is: Disp "HELLO", then the program will hang when it's truncated "HELLO" down to LLO", because two Ls are next to each other, and so strpbrk("LLO"", "L") will return "LLO"".

Is there some other way to split strings that doesn't have this problem? Can I slice them like "string[2:5]" and get back the string containing characters 3, 4, and 5 (or something like that)? If I knew a bit more about C I'm sure I could figure this out... but I don't.



The userbars in my sig are links embedded links.

And in addition to calculator (and Python!) stuff, I mod Civilization 4 (frequently with Python).

Offline calcdude84se

  • Needs Motivation
  • LV11 Super Veteran (Next: 3000)
  • ***********
  • Posts: 2272
  • Rating: +78/-13
  • Wondering where their free time went...
    • View Profile
Re: GCC4TI string manipulation functions
« Reply #8 on: July 24, 2010, 02:39:15 pm »
Hm... First of all, you probably don't need to make the method recursive.
I'd suggest an approach of looping through the string with pointers rather than making it recursive and copying the string every time.
Secondly, you probably shouldn't be using strpbrk() to move to the next token, rather, using a looping method, you should just advance the pointer manually to the next token.
So, some C: (I'm assuming the existence of some external functions and some pre-existing variables and memory locations)
Code: [Select]
tokenInfo findToken(*char string); /*Given a string, this should find the longest token and return its length and hex*/
typedef struct {
    size_t chars; /* How many chars are in the string making up the token */
    BOOL twoByteToken; /*Whether or not the token is two bytes*/
    byte hex[2]; /*the two bytes making up the token, or only hex[0] if it's a one-byte token*/
} tokenInfo;
/*Working from these two...*/
void tokenize(*char string, *byte buffer) { /*Takes a string and a buffer to store generated hex. Make the buffer dynamic on your own time. (Or ask me ;D) */
    tokenInfo currentToken;
    while(*string != '\0') {
        currentToken = findToken(string);
        *(buffer++) = currentToken.hex[0];
        if(currentToken.twoByteToken)
            *(buffer++) = currentToken.hex[1];
        string += currentToken.chars;
    }
}
Would something like this work?
« Last Edit: July 24, 2010, 02:40:04 pm by calcdude84se »
"People think computers will keep them from making mistakes. They're wrong. With computers you make mistakes faster."
-Adam Osborne
Spoiler For "PartesOS links":
I'll put it online when it does something.

Offline TC01

  • LV6 Super Member (Next: 500)
  • ******
  • Posts: 344
  • Rating: +9/-0
    • View Profile
Re: GCC4TI string manipulation functions
« Reply #9 on: July 24, 2010, 04:49:10 pm »
That seems like it would work much better than what I was doing. Thanks!

Your assumptions are mostly correct- I hadn't been using a tokenInfo structure, but I eventually would have needed to create one. (I was using arrays for the handful of tokens I had defined, but that would have gotten unwieldy pretty quickly, especially with two-byte tokens).


Also, sort of unrelated: I'm wondering if it might be easier to define all the tokens (the strings and the hex) in a text file on the calculator, and read them in as tokenInfo structures (well, as a collection of tokenInfo structures defined in a tokens structure, or something like that). That way, I don't have to figure out how to handle all the weird characters, plus, the list of tokens could be modified without recompiling everything. The only downside is that the text file would be required on-calc.

To more experienced programmers- which seems like a better idea? Define them in the program or an external text file?



The userbars in my sig are links embedded links.

And in addition to calculator (and Python!) stuff, I mod Civilization 4 (frequently with Python).

Offline calcdude84se

  • Needs Motivation
  • LV11 Super Veteran (Next: 3000)
  • ***********
  • Posts: 2272
  • Rating: +78/-13
  • Wondering where their free time went...
    • View Profile
Re: GCC4TI string manipulation functions
« Reply #10 on: July 24, 2010, 05:26:02 pm »
The tokenInfo struct was intended to simply hold enough data for the tokenization routine.
If you wanted to store information in a text file, the structure would be more like an alpha-sorted sequence of strings with the corresponding information (one- or two-byte, the token hex).
Actually, if you chose the format appropriately, you could make it user-editable, like this: (this is only a suggestion, you can probably come up with a better format that does the same) (also, note that I don't know the hex for any tokens, so it's made up)
Code: [Select]
Disp
*45
Output(
*84
GarbageCollect
BB68
sin(
*97
OpenLib(
82d2
Note that you'd have to alpha-sort it at the beginning of the program (probably temp-only), but it would make it user-editable ;D
"People think computers will keep them from making mistakes. They're wrong. With computers you make mistakes faster."
-Adam Osborne
Spoiler For "PartesOS links":
I'll put it online when it does something.

Offline TC01

  • LV6 Super Member (Next: 500)
  • ******
  • Posts: 344
  • Rating: +9/-0
    • View Profile
Re: GCC4TI string manipulation functions
« Reply #11 on: July 25, 2010, 02:22:26 pm »
Well, GCC4TI seems to take issue with some of that- it didn't like me defining things as bytes or BOOLs (or with asterisks before the variable type). I changed it, though (to use integers instead of bytes and bools).

But shouldn't the tokenInfo object also contain the name of the string? Or is there a way to compare the name of a variable against a string?

And I did add it, would the size_t chars data even be necessary? Couldn't I just call strlen() on the string?



The userbars in my sig are links embedded links.

And in addition to calculator (and Python!) stuff, I mod Civilization 4 (frequently with Python).

Offline Lionel Debroux

  • LV11 Super Veteran (Next: 3000)
  • ***********
  • Posts: 2135
  • Rating: +290/-45
    • View Profile
    • TI-Chess Team
Re: GCC4TI string manipulation functions
« Reply #12 on: July 25, 2010, 03:09:03 pm »
"unsigned char" is 8 bits, "unsigned short" is 16 bits. "enum Bool", "BYTE" and "BOOL" are seldom used.

Quote
The trouble is if the "next character" in the string is the same as one of the characters that's part of the token, it won't work. So if the line is: Disp "HELLO", then the program will hang when it's truncated "HELLO" down to LLO", because two Ls are next to each other, and so strpbrk("LLO"", "L") will return "LLO"".
In many parsers, characters are divided into classes of lexical elements (string, comma, etc.), and a lexical analyzer is built to recognize them (e.g. a string starts with a '"' and it lasts until the next '"' or maybe end of line). On top of the lexical analyzer, there's a syntactical analyzer, which tries to parse a stream of lexical elements (get an element, then act upon its type - maybe there are other expected elements of various types after it). And on top of the syntactical analyzer, if the language is complicated enough, there's a semantics analyzer (e.g. {, number 234, comma, number 567, comma, number 8 is a list).

But maybe I'm replying out of scope because I'm dense ? :)
Member of the TI-Chess Team.
Co-maintainer of GCC4TI (GCC4TI online documentation), TILP and TIEmu.
Co-admin of TI-Planet.

Offline TC01

  • LV6 Super Member (Next: 500)
  • ******
  • Posts: 344
  • Rating: +9/-0
    • View Profile
Re: GCC4TI string manipulation functions
« Reply #13 on: July 25, 2010, 03:24:12 pm »
"unsigned char" is 8 bits, "unsigned short" is 16 bits. "enum Bool", "BYTE" and "BOOL" are seldom used.

Okay, so really there's no reason to use BYTE or BOOL, then? Or is there some specific case where BYTE and BOOL are better to use?

Quote
Quote
The trouble is if the "next character" in the string is the same as one of the characters that's part of the token, it won't work. So if the line is: Disp "HELLO", then the program will hang when it's truncated "HELLO" down to LLO", because two Ls are next to each other, and so strpbrk("LLO"", "L") will return "LLO"".
In many parsers, characters are divided into classes of lexical elements (string, comma, etc.), and a lexical analyzer is built to recognize them (e.g. a string starts with a '"' and it lasts until the next '"' or maybe end of line). On top of the lexical analyzer, there's a syntactical analyzer, which tries to parse a stream of lexical elements (get an element, then act upon its type - maybe there are other expected elements of various types after it). And on top of the syntactical analyzer, if the language is complicated enough, there's a semantics analyzer (e.g. {, number 234, comma, number 567, comma, number 8 is a list).

But maybe I'm replying out of scope because I'm dense ? :)

Well, I understand what you're saying, but I'm not sure how it's relevant... Well, I know how it's relevant (since I'm writing a program to look through text files), but I'm not sure if any of the information is useful to my current problem.



The userbars in my sig are links embedded links.

And in addition to calculator (and Python!) stuff, I mod Civilization 4 (frequently with Python).

Offline calcdude84se

  • Needs Motivation
  • LV11 Super Veteran (Next: 3000)
  • ***********
  • Posts: 2272
  • Rating: +78/-13
  • Wondering where their free time went...
    • View Profile
Re: GCC4TI string manipulation functions
« Reply #14 on: July 25, 2010, 04:23:42 pm »
Unsigned char is a fine replacement for byte.
And as for the asterisks in the code, they should be between the variable type and variable name, my bad. (I haven't done C in a while, and I have never done C for the 68k calcs)
The reason I didn't have tokenInfo have the string was because it was unnecessary for the example. If you need it though, it can be added. Make sure you choose a long enough buffer for the chars, though, or that you can use dynamic memory allocation.
"People think computers will keep them from making mistakes. They're wrong. With computers you make mistakes faster."
-Adam Osborne
Spoiler For "PartesOS links":
I'll put it online when it does something.