Omnimaga

Calculator Community => TI Calculators => Calculator C => Topic started by: TC01 on July 19, 2010, 05:52:26 pm

Title: GCC4TI string manipulation functions
Post by: TC01 on July 19, 2010, 05:52:26 pm
The first 68k C question in this subforum! (the flash apps one I posted was in the Other Calculator Help and Support forum)

Anyway, I'm having a little trouble understanding some of the string manipulation functions in GCC4TI (string.h). It's probably because of the way strings are treated in C (as arrays) which isn't really what I'm used to, being a programmer in higher level languages (Python and VB). Also because I haven't really used pointers before in programming.

Specifically, I want to read a text file (*.89t) and split it line by line, then do operations on each line. I've gotten my program to a point where it uses a dialog box to get the name of the text file, opens and reads it, and then prints the text to the screen (this last bit is just for debugging). It's the "splitting line-by-line" that's causing me problems.

I've realized that the text files use "\r" as newline characters (at least, I think they do) and so I thought to use strscn() to get the length of the string up until the "line", then use strcpy() to copy the line into another string and strpbrk() to create a substring without the first line.

The first problem is that if you type "\", GCC4TI ignores the closing quotation mark, assuming another character is coming after it. I've had to use "\r"- but the problem of this is that if the line has "r" in it, strscn() would stop early (because it's characters from the second string, not the second string as a whole). Is there any way to solve this? Or am I wrong and, in fact, it would only stop when it found both "\" and "r"?

Then, the second problem is that I'm unsure what exactly strpbrk() does. The docs say it "returns a pointer to the first occurrence of any of the characters in s2"- is this just a pointer to that one character, or a pointer to everything from that character to the end of the string?
Title: Re: GCC4TI string manipulation functions
Post by: TC01 on July 21, 2010, 11:10:35 pm
Well, I've figured most of this out on my own.

Except for one thing- strcspn is not working as expected. I have the code below (seperate project I created to test it):

Code: [Select]
// A test of the strcspn function, and how it works
void _main(void)
{
const char string[10] = "Abc\Defg";
unsigned long length = strcspn(string, "\\");
clrscr();
printf("%d", (int)length);
ngetchx();
}

"\\" is apparently equivalent to a single backslash- at least, it is when I print it out. Even if it isn't, it shouldn't matter what happens here because the documentation says strcspn should return the length of string 1 containing no characters from string 2. But evidently it does... the program prints out 7.

What's wrong here? How can I read a string up to a backslash? strcspn works fine if I have it run up to the "D" in the above test... can I fix this?
Title: Re: GCC4TI string manipulation functions
Post by: TC01 on July 22, 2010, 11:35:20 am
Well... I've figured this out- \ isn't picked up because a single backslash plus the next character (like \n or \r) is a seperate character. The actual string would need to have a double backslash for it to work. Fortunately, if \r is a single character then I can look for it using strncpy() and strcspn() without problems.

Don't know if I should even bother posting updates since there aren't really any 68k C programmers around... but I will anyway if I have more issues, just in case someone actually looks at and replies to this.
Title: Re: GCC4TI string manipulation functions
Post by: Lionel Debroux on July 22, 2010, 12:40:05 pm
Sorry I missed your topic, I don't attend this section on a regular basis.
Indeed, '\r' is a single character, just like '\\'.
Title: Re: GCC4TI string manipulation functions
Post by: calcdude84se on July 22, 2010, 04:30:49 pm
Yeah, things like '\r', '\\', '\"', '\t', and '\'' are all single characters, and are called "escape sequences." Without them you couldn't easily embed newlines, quotes, tabs, etc. in strings.
Title: Re: GCC4TI string manipulation functions
Post by: TC01 on July 22, 2010, 09:57:39 pm
Okay, thanks for that information- I knew about \n, of course... didn't realize there were other escape sequences (apart from \\).

It seems like the "split text file into lines" part of the program is working now, but if I have any other questions I'll post them.
Title: Re: GCC4TI string manipulation functions
Post by: calcdude84se on July 22, 2010, 10:02:42 pm
Nice to hear it's working! Yeah, just ask if you have more questions, we'll gladly answer :D
Title: Re: GCC4TI string manipulation functions
Post by: TC01 on July 24, 2010, 01:45:24 pm
Unfortunately I'm having another issue- although not with the "read text files into lines" part, the "parse lines into 83+ tokens" part (see here (http://ourl.ca/6442) for what my project is).

The way I'm removing characters from a string is strpbrk(). So my function to parse lines works recursively and does the following:

1. Send the line to a function to check if it's a token.
2. If it is a valid token, adjust the array of hexadecimel tokens and return.
3. If it is not, truncate it by removing a character from the end each time using strncpy() until it finds a token.
4. When a token is found, adjust the array of hexadecimel tokens, use strpbrk() to remove the part of the string that's a token (by getting the first character in the string that's not part of the token), and call the entire function again, this time for the rest of the string.

The trouble is if the "next character" in the string is the same as one of the characters that's part of the token, it won't work. So if the line is: Disp "HELLO", then the program will hang when it's truncated "HELLO" down to LLO", because two Ls are next to each other, and so strpbrk("LLO"", "L") will return "LLO"".

Is there some other way to split strings that doesn't have this problem? Can I slice them like "string[2:5]" and get back the string containing characters 3, 4, and 5 (or something like that)? If I knew a bit more about C I'm sure I could figure this out... but I don't.
Title: Re: GCC4TI string manipulation functions
Post by: calcdude84se on July 24, 2010, 02:39:15 pm
Hm... First of all, you probably don't need to make the method recursive.
I'd suggest an approach of looping through the string with pointers rather than making it recursive and copying the string every time.
Secondly, you probably shouldn't be using strpbrk() to move to the next token, rather, using a looping method, you should just advance the pointer manually to the next token.
So, some C: (I'm assuming the existence of some external functions and some pre-existing variables and memory locations)
Code: [Select]
tokenInfo findToken(*char string); /*Given a string, this should find the longest token and return its length and hex*/
typedef struct {
    size_t chars; /* How many chars are in the string making up the token */
    BOOL twoByteToken; /*Whether or not the token is two bytes*/
    byte hex[2]; /*the two bytes making up the token, or only hex[0] if it's a one-byte token*/
} tokenInfo;
/*Working from these two...*/
void tokenize(*char string, *byte buffer) { /*Takes a string and a buffer to store generated hex. Make the buffer dynamic on your own time. (Or ask me ;D) */
    tokenInfo currentToken;
    while(*string != '\0') {
        currentToken = findToken(string);
        *(buffer++) = currentToken.hex[0];
        if(currentToken.twoByteToken)
            *(buffer++) = currentToken.hex[1];
        string += currentToken.chars;
    }
}
Would something like this work?
Title: Re: GCC4TI string manipulation functions
Post by: TC01 on July 24, 2010, 04:49:10 pm
That seems like it would work much better than what I was doing. Thanks!

Your assumptions are mostly correct- I hadn't been using a tokenInfo structure, but I eventually would have needed to create one. (I was using arrays for the handful of tokens I had defined, but that would have gotten unwieldy pretty quickly, especially with two-byte tokens).


Also, sort of unrelated: I'm wondering if it might be easier to define all the tokens (the strings and the hex) in a text file on the calculator, and read them in as tokenInfo structures (well, as a collection of tokenInfo structures defined in a tokens structure, or something like that). That way, I don't have to figure out how to handle all the weird characters, plus, the list of tokens could be modified without recompiling everything. The only downside is that the text file would be required on-calc.

To more experienced programmers- which seems like a better idea? Define them in the program or an external text file?
Title: Re: GCC4TI string manipulation functions
Post by: calcdude84se on July 24, 2010, 05:26:02 pm
The tokenInfo struct was intended to simply hold enough data for the tokenization routine.
If you wanted to store information in a text file, the structure would be more like an alpha-sorted sequence of strings with the corresponding information (one- or two-byte, the token hex).
Actually, if you chose the format appropriately, you could make it user-editable, like this: (this is only a suggestion, you can probably come up with a better format that does the same) (also, note that I don't know the hex for any tokens, so it's made up)
Code: [Select]
Disp
*45
Output(
*84
GarbageCollect
BB68
sin(
*97
OpenLib(
82d2
Note that you'd have to alpha-sort it at the beginning of the program (probably temp-only), but it would make it user-editable ;D
Title: Re: GCC4TI string manipulation functions
Post by: TC01 on July 25, 2010, 02:22:26 pm
Well, GCC4TI seems to take issue with some of that- it didn't like me defining things as bytes or BOOLs (or with asterisks before the variable type). I changed it, though (to use integers instead of bytes and bools).

But shouldn't the tokenInfo object also contain the name of the string? Or is there a way to compare the name of a variable against a string?

And I did add it, would the size_t chars data even be necessary? Couldn't I just call strlen() on the string?
Title: Re: GCC4TI string manipulation functions
Post by: Lionel Debroux on July 25, 2010, 03:09:03 pm
"unsigned char" is 8 bits, "unsigned short" is 16 bits. "enum Bool", "BYTE" and "BOOL" are seldom used.

Quote
The trouble is if the "next character" in the string is the same as one of the characters that's part of the token, it won't work. So if the line is: Disp "HELLO", then the program will hang when it's truncated "HELLO" down to LLO", because two Ls are next to each other, and so strpbrk("LLO"", "L") will return "LLO"".
In many parsers, characters are divided into classes of lexical elements (string, comma, etc.), and a lexical analyzer is built to recognize them (e.g. a string starts with a '"' and it lasts until the next '"' or maybe end of line). On top of the lexical analyzer, there's a syntactical analyzer, which tries to parse a stream of lexical elements (get an element, then act upon its type - maybe there are other expected elements of various types after it). And on top of the syntactical analyzer, if the language is complicated enough, there's a semantics analyzer (e.g. {, number 234, comma, number 567, comma, number 8 is a list).

But maybe I'm replying out of scope because I'm dense ? :)
Title: Re: GCC4TI string manipulation functions
Post by: TC01 on July 25, 2010, 03:24:12 pm
"unsigned char" is 8 bits, "unsigned short" is 16 bits. "enum Bool", "BYTE" and "BOOL" are seldom used.

Okay, so really there's no reason to use BYTE or BOOL, then? Or is there some specific case where BYTE and BOOL are better to use?

Quote
Quote
The trouble is if the "next character" in the string is the same as one of the characters that's part of the token, it won't work. So if the line is: Disp "HELLO", then the program will hang when it's truncated "HELLO" down to LLO", because two Ls are next to each other, and so strpbrk("LLO"", "L") will return "LLO"".
In many parsers, characters are divided into classes of lexical elements (string, comma, etc.), and a lexical analyzer is built to recognize them (e.g. a string starts with a '"' and it lasts until the next '"' or maybe end of line). On top of the lexical analyzer, there's a syntactical analyzer, which tries to parse a stream of lexical elements (get an element, then act upon its type - maybe there are other expected elements of various types after it). And on top of the syntactical analyzer, if the language is complicated enough, there's a semantics analyzer (e.g. {, number 234, comma, number 567, comma, number 8 is a list).

But maybe I'm replying out of scope because I'm dense ? :)

Well, I understand what you're saying, but I'm not sure how it's relevant... Well, I know how it's relevant (since I'm writing a program to look through text files), but I'm not sure if any of the information is useful to my current problem.
Title: Re: GCC4TI string manipulation functions
Post by: calcdude84se on July 25, 2010, 04:23:42 pm
Unsigned char is a fine replacement for byte.
And as for the asterisks in the code, they should be between the variable type and variable name, my bad. (I haven't done C in a while, and I have never done C for the 68k calcs)
The reason I didn't have tokenInfo have the string was because it was unnecessary for the example. If you need it though, it can be added. Make sure you choose a long enough buffer for the chars, though, or that you can use dynamic memory allocation.
Title: Re: GCC4TI string manipulation functions
Post by: TC01 on July 25, 2010, 08:04:12 pm
Unsigned char is a fine replacement for byte.

Does it really make a difference if I use unsigned char or unsigned int?

I mean, I know that unsigned char is the same size as a byte. It's just I already have the code written that treats the output as an array of integers.
Title: Re: GCC4TI string manipulation functions
Post by: calcdude84se on July 25, 2010, 08:13:25 pm
An array of integers? Don't you have to treat it as a byte (or unsigned char) array for it to have the same format as on the 83+ Series.
I mean, it could work, but that would make the files larger and the PC converter more complicated.
I wish you good luck with whatever avenue you take, though :)
Title: Re: GCC4TI string manipulation functions
Post by: TC01 on July 25, 2010, 11:05:28 pm
...and now I have a memory leak. I know where it's coming but not how to fix it.

The tokenizePrgm() function (which is the core of my tokenizer code- it's essentially a main function called by the real main function in another file) creates a pointer to an array of NUM_TOKEN tokenInfo objects.

This array is passed to an initTokens() procedure, which builds tokenInfo objects based on data stored in three arrays (this will change, obviously, when I implement the token file). I dynamically allocate memory to the names of each of those tokens.

The tokens array is then passed to all functions that need to use it, and at the end, in tokenizePrgm() I free up the tokens array.

Unfortunately, I haven't freed up the memory I allocated to the tokenInfo.name part of each token. And if I try to do it by looping over each token and calling:

Code: [Select]
free(tokens[i].name);
I get an address error.

Is there any way I can free this memory, or is the only solution not to use dynamic memory allocation here?
Title: Re: GCC4TI string manipulation functions
Post by: Lionel Debroux on July 26, 2010, 01:36:52 am
Quote
Does it really make a difference if I use unsigned char or unsigned int?
Size and potentially padding: unsigned int is, with the default compiler settings, 16 bits, while unsigned char is 8 bits.