Wednesday, June 4, 2008

How to go about creating a new RapidQ compiler ?

Let us see the main parts that a compiler should be made of.

First, we need a lexical analyser to tokenise the source code. Next we need a parser which would analyse the syntax of the source code with the help of the lexical analyser. The parser will also needed a code generator to output the compiled program. Optionally there could a code optimizer to optimize the output code for more efficiency and speed.

You can build a lexical by manually coding one or by creating a lexical analyser using tools like 'Lex' or 'Flex'. Similarly, you can code the parser manually or by using software like 'Yacc' or 'Bison'. There also other such software (sometimes called compiler-compiler) like 'Gentle', 'Gold Parser' etc. 'Flex' and 'Bison' are open-source and free software. The others are freeware. Many resources like tutorials etc. are available on the Net for these softwares.

Ok. How much similar should our new compiler be, to RapidQ ? As far as the syntax of RapidQ is concerned our new compiler should be as far as possible be compatible with RapidQ because that was what has made it easy to program for novices and attracted them to RapidQ. But certain changes could be made to make it more modern but still easy to program.

The main area I would like to talk about is regarding the code generation and runtime environment. As I have said earlier in my previous post, the RapidQ compiler compiles the source code into an intermediate language, binds it with an interpreter and creates executable file containing both. When you double click the executable file, the interpreter inside the executable file loads the compiled intermediate code which is also embedded inside the same file and runs it.

If we want, we could follow the same technique. The disadvantages are :
  • since each executable file produced contains the interpreter also, the size of the executable file is increased by the interpreter's size.
  • since the compiled program is in an intermediate language and is being interpreted at run time, it is slower that an equivalent program compiled into machine language.
But the advantages are :
  • the intermediate code produced on compilation could be the same on all platforms ( both operating system and microprocessor) which can be interpreted by an appropriate interpreter designed for that particular platform.
  • The compiler writer need not known the assembly language or the machine language of different platforms since the interpreter can also be written in a high level language. He need know only about the intermediate code.

Another path that we could take is to compile the source code directly into machine language. The advantage is that the compiled code will be very fast and the size will be lesser than in the above mentioned case since there is not need of an interpreter to be embedded into the executable file. But the disadvantage is that for the code generation, the compiler should know the corresponding assembly language and architecture of the targeted platform. If you are planning to write the compiler for different platforms you will need to learn the assembly language/ machine language and architecture of the different platforms which will be a daunting task.

Whatever be the route you opt, writing a new compiler is bound to be an uphill task but also interesting and adventurous since it is fraught with many risks. Risks in the sense that there are chances of getting stuck at sometime due to the complexities and lack of theoretical knowledge of creating a compiler.

But is there a better route to take which could be easier ? Perhaps. I have an idea. But you can argue that the result cannot be called a compiler in the real sense! I agree. But by taking this route, you may be able to create a quick and dirty compiler to start with and later improve certain parts of it so that finally it could be called a real compiler. And as a bonus, you can boast that it runs on DotNet!!

Wait till my next post!

See you!