every good compiler needs to have a few basic parts:
-a lexer this converts plain text files into strings of tokens, with associated types (string literal, type identifier, list, etc) which the language uses. there are plenty of dedicated lexers out there that you can easily include directly in a new language. flex is a particularly popular one.
-a parser this takes your now-tokenised source and searches for illegal statements (something like: int "tasty";, for example, if you're writing C). there is a lot of interesting theory behind the best methods of doing this, but again, you'd be best off just using one of the many pre-made parsers out there. bison is a popular one
-a method for organising data elements (symbol table) and recursive statements (syntax tree). this step will be as easy or as difficult as the complexity level of the program you're writing. if you're making a simple, purely functional language you can use whatever structure you want to store your symbol table. even a giant list would do, though you'd probably want to use something like a heap in order to reduce lookup times. the syntax tree does pretty much have to be a tree structure. this is because it breaks code up into a hierarchical structure so that everything afterwards knows the correct order in which to read your source code. if you had the following code, for example:
Code: Select All | Copy To Clipboard
1 2 3 4 5 6 7 8 | For(A,0,2) A→C If A=B <do foo> Else <do bar> End End |
Code: Select All | Copy To Clipboard
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | [<rest of program above>] | [For loop] / | \ [initialiser] [body] [increment] / | \ [0→A] [If statement] [A++] / \ [condition] [body] | / \ [A≤2] [A→C] \ [two part If statement] / | \ [condition] [then part] [else part] | | | [A=B] [<do foo>] [<do bar>] |
-semantic checking. traverse your tree and find any inconsistencies (like a function that expects two parameters being passed three instead or a statement like "string literal" == integer, where integer is of type int).
-code generation part. if you're writing a simple, functional language, this part is really easy. just directly translate your now, organised program into its counterpart in whatever other language you're using (converting to machine code, some other language, or bytecode for a virtual machine), looking up variables in your symbol table as you encounter them while traversing the syntax tree and inserting references to them in your generated code (direct addresses for machine code or a virtual machine or names if you're translating to some other language).
-code optimisation part. scan over your generated code and apply any optimisations you can find. obviously, this part isn't necessary, but you will get slow results without it.




