Compilation and Interpretation in C and CPython ----------------------------------------------- It is common knowledge that programs written in high level languages have to be translated into a low level language using programs referred to as `translators`. This low level language is either in a `native` form, in the sense that it is understood by the operating system itself, or in an `intermediate` form which is understood by an intermediate program such as the `bytecode interpreter`. It is also known that C is a compiled language, whereas CPython is both first compiled `and` then interpreted. In this article, I will try to illustrate the difference between the two languages by carrying out some simple experiments (on Linux). Consider a program, `helloworld.c`: .. code-block:: c # include int main(int argc, char **argv) { printf("Hello World!\n"); return 0; } We can compile and execute this program as follows: .. code-block:: c $ gcc -o helloworld helloworld.c $ ./helloworld Hello World The executable file `helloworld` is the low level language equivalent of the high level language program, `helloworld.c`. This is what the operating system on your computer understands and hence when it is executed, it prints `Hello World` on the screen. This process of converting `helloworld.c` to `helloworld` represents the *translation* process. In the case of C, this translation process is performed by the compiler, `gcc` (this translation process is really a process pipeline and involves two other processes `preprocessing` and `linking`, which are carried out by separate programs, automatically invoked by `gcc`). Nevertheless, `compilation` is at the core of the translation process of C programs and is responsible for converting a high level language to its low level equivalent - a version readily executable by the operating system on your computer. It is important to note that this executable file is composed of the instructions you wrote in your C program, along with a other details which are necessary to execute your program. These details are specific to the architecture and operating system you created the executable file on and hence if you copy a executable file you created on a computer with an Intel processor, it will not work at all on a computer which has an ARM processor, for example. Hence, you will have to recompile the program on the new computer before you can execute it. Note that your C program still remains the same, but the low level machine language equivalent is different and `gcc` takes care of this translation. Now, let us consider our first CPython program:: # Print Hello world if __name__=='__main__': print 'Hello World!' You executed this program as follows :: $ python helloworld.py Hello World! Unlike C, where you compiled the program first to get a separate executable file and then executed it (a two step process), here you executed the program in a single step - your program is directly executed on-the-fly. This is how traditionally `interpreters` (interpreted languages) worked. However, modern day interpreted languages like CPython (and others) also involve a compilation step. Your program `helloworld.py` is first converted to an intermediate representation which is a low level equivalent of your high level language program. The difference from C is that the instructions in these low level language equivalents are not meant to be executed by a *real* computer, but a *process virtual machine* [#]_. In the case of CPython, the intermediate representation is known as `bytecodes` and the virtual machine referred to as the `bytecode interpreter` or the CPython virtual machine. Hence, the CPython code is first converted into its bytecode equivalent which is then executed by the bytecode interpreter. When you run a CPython program using `python helloworld.py`, both these steps happen in the background. While discussing the section on C compilation I mentioned that the executable you create on an Intel computer will not run on an ARM computer, because of the architecture specific instructions embedded into the executable required for executing the program. In the case of CPython the bytecodes (result of compilation of the CPython program) are executed by the CPython virtual machine, instead of the real computer. This extra layer of abstraction allows you to execute the same bytecodes (without recompiling) on an Intel computer and an ARM computer, for example. Let us understand this better with a real example. I will use two computers for the experiments: System1 and System2, with both running Fedora Linux. However, System1's instruction set architecture is x86_64 (Intel) where as System2 is a `RaspberryPi `_ with an armv6l (ARM) instruction set. C ~ First, I will the consider the ``helloworld.c`` program. I will compile this program on System1:: $ arch x86_64 $ gcc -o helloworld helloworld.c $ file ./helloworld ./helloworld: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.32, BuildID[sha1]=0xc50d74290927cb25ef9e34055af6c437e89ed5eb, not stripped The ``file`` command shows the type of a file [#]_ and from the above output, the key information for us is that the file ``helloworld`` is a `ELF 64-bit LSB executable, x86-64, version 1 (SYSV)`. You can of course execute the program as we have done earlier using ``./helloworld``. Now, copy the file, ``helloworld`` to System2, and try to execute the object file:: $ arch armv6l $ file helloworld helloworld: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.32, BuildID[sha1]=0xc50d74290927cb25ef9e34055af6c437e89ed5eb, not stripped $ ./helloworld -bash: ./helloworld: cannot execute binary file It is clear from the above error message, that ``helloworld`` could not be executed on System2. Now, transfer the ``helloworld.c`` file to System2 and compile and execute the file as on System1:: $ gcc -o helloworld helloworld.c $ file helloworld helloworld: ELF 32-bit LSB executable, ARM, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.32, BuildID[sha1]=0xba57691af19ff94f894645398e66e263c8f57a9b, not stripped $ ./helloworld Hello World! As you can see, the file format of ``helloworld`` is different on System2 as expected and hence it had to be recreated to execute it. CPython ~~~~~~~ On System1, create the `compiled` version of ``helloworld.py`` using the following code [#]_:: $ python -c "import py_compile;py_compile.compile('helloworld.py')" Or, the cleaner version: ``$ python -m py_compile helloworld.py``. This will create a ``helloworld.pyc`` file in your directory. Once again, we can use the ``file`` command to see the file type of ``helloworld.pyc``:: $ file helloworld.pyc helloworld.pyc: python 2.7 byte-compiled To execute the compiled file, simply invoke the ``python`` interpreter with the ``helloworld.pyc`` file as an argument, rather than the source file: ``python helloworld.pyc``. Now, copy the file ``helloworld.pyc`` to System2 and try to execute it:: $ arch armv6l $ file helloworld.pyc helloworld.pyc: python 2.7 byte-compiled $ python helloworld.pyc Hello World! To summarize, the compiled ``helloworld.pyc`` could be executed without being re-created from its source file, ``helloworld.py`` on two systems with different instruction set architecture. This was made possible by the ``python`` bytecode interpreter on the two systems, which created an abstraction between the bytecodes and the native instruction set architecture [#]_. I should mention here that if your CPython application has anything to do beyond pure CPython code (C extension, for example), the results of the experiments here will not be applicable. ``python`` The CPython executable, ``python`` is nothing but a ELF file (similar to your ``helloworld`` but obviously created from a more complicated set of C source files). The almost magical behavior of CPython bytecodes that we saw in the previous section is made possible by ``python`` taking care of the steps necessary to execute the bytecodes on systems with different instruction set architecture. To understand this better, consider the following two commands, the first on System1 and the second on System2:: $ file /usr/bin/python2.7 /usr/bin/python2.7: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.32, BuildID[sha1]=0x9d8a414b778ff11ec075995248c43cdf5b67f17a, stripped $ file /usr/bin/python2.7 /usr/bin/python2.7: ELF 32-bit LSB executable, ARM, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.32, BuildID[sha1]=0x63fd81d3591769d6be0619b7273935ab9521010c, stripped As is clear from the above output, the file ``/usr/bin/python2.7`` (``/usr/bin/python`` is symlinked to ``/usr/bin/python2``, which is in turn symlinked to ``/usr/bin/python2.7`` in reality), is an ELF executable and it has obviously been compiled separately on both these systems (thus showing the different ELF file formats). Conclusion ~~~~~~~~~~ The above experiments have hopefully shed some light on C being a compiled language and CPython being a compiled and interpreted language - this design leads to its interoperability between different architectures. However, it is important that I mention a language is not compiled or interpreted. That is, it is not technically 100% accurate to say that C is a compiled language. A language implementation, rather than the language is compiled or interpreted. There are interpreters for the C language which interpret your C programs and there are CPython implementations which are compiled (The water gets murkier in the case of CPython, and the boundary between being compiled and interpreted not always clear). Footnotes ~~~~~~~~~ .. [#] Note that, there are two kinds of virtual machines that can be implemented in software: `system virtual machine` and `process virtual machine`. Here, I am referring to the process virtual machine. See the `Wikipedia article `_ on Virtual Machine to learn more. .. [#] http://linux.die.net/man/1/file .. [#] The ``py_compile`` module can be used to compile a CPython program into its bytecode equivalent. This is the version of your program that is executed the CPython bytecode interpreter. See: ``_. .. [#] Actually, to be more accurate, the ``python`` executable takes care of the interfacing with the operating system kernel (Linux Kernel), which is once again different on systems with different instruction set architecture. See also ~~~~~~~~ - `List of readings on Compilers and Interpreters `_ - `Instruction Set Architecture `_