Sep 12, 2022 by Charles Beumier | 976 views
We present a discussion on the use of the C and the Python languages and the use of Python calling a C library. Some arguments are general while others arise from the specific application of modifying a .csv (text) file.
Python is generally advised for rapid prototyping and benefits from many libraries. It is a high-level, object oriented language. On the contrary, C is a mid level language, closer to machine language. So Python would be more suited to general purpose applications while C would be a good choice for midware (typically drivers) or number crunching programs.
Python execution first transforms the code (.py) into bytecode (.pyc), if such a file does not exist or is not up to date. The bytecode, which is not machine code, is easier and quicker to interpret by the Python Virtual Machine which runs the bytecode. Python is said to be interpreted. On the contrary, a C program is compiled to create an executable file with machine code. As a second main difference, Python has automatic Garbage Collection while a C programmer has to allocate and free memory 'manually'. Generally speaking, there is less overhead in C and a C program generally runs faster. But Python is more robust. Debugging in C requires much effort, especially for the famous 'segmentation fault' (bad memory access). There is indeed no control on the addresses used by pointers. Alternatively Python detects mistakes and displays the location and description of errors. The advantage in efficiency for C is often accessible in Python thanks to libraries written in C (e.g. numpy for numerical operations).
We had to write a program to modify integer numbers in many large .csv ('comma separated value') files, that is to find a few specific field values (separated by ',' in the file) and replace them with a transformed number. This operation aims at hiding the identity of people represented by these numbers. A look-up-table is used to modify the last 5 digits of the numbers.
We first coded this in Python, taking advantage of the prototyping quality of the language, to evaluate how long it would take to process 1 TB of .csv files. Secondly, we searched and found csvlib and pandas as Python libraries to handle the csv format, but a limited gain was achieved in execution time (see Table 1). So we looked for some optimisation in Python and realised that a factor 2 improvement in speed was possible with numpy, simply thanks to the optimised task of looking for separators (commas and new lines). Unfortunately we could not find a way to optimise the 5-digit transform. Because the total runtime was still huge (about 10 hours), we wrote the same code in C and could reach an additional 9 times speedup, or even nearly 17 times when using the optimisation '-O3' of the gcc compiler, totalising a factor 44 in speed improvement compared to the direct Python program with no libraries.
|Method||Time (5Mb)||Time (20 Mb)||Hypothetical 1 TB||Speedup|
|Python no Lib||0.40 s||1.77 s||24.6 h||---|
|Python csvlib||0.36 s||1.36 s||18.9 h||1.30 x|
|Python pandas||0.31 s||1.17 s||16.3 h||1.51 x|
|Python numpy||0.17 s||0.70 s||9.7 h||2.53 x|
|C (gcc)||0.021 s||0.075 s||1.04 h||23.6 x|
|C (gcc -O3)||0.011 s||0.040 s||0.56 h||44.2 x|
Table 1. Runtime and speedup for different implementations of pseudonymisation by 5-digit replacement using table lookup.
This development recalled us that, although we have a large experience in C and Python, we faced the well known facts that the Python version was easily written and debugged, leading to a clean and short code, while on the contrary, we could not avoid some typical C mistakes (and segmentation faults!) that made us spend more development time.
More precisely, because the files can be very large (up to 25 MB each), we programmed a loop processing chunks of the file successively. We experienced that there is an optimal chunk size of about 10 to 100 kB (10^4 to 10^5 characters). We came also to the conclusion that the Python freadline(), reading the file line by line (lines end with a '\n' character) is relatively slow. The main optimisation possible in Python was to use numpy (numpy.where(byteL == ord(',')) to look for specific characters in the text array of chars (like '\n' for new line and ',' for separators). This was quite fast, but the replacement of values in the file was slow in Python, so that the global gain was only about 2.5. As another lesson learned, let us add that with pandas, we had to specify 'dtype=str' as second argument to the read function csv_read(), otherwise attribute numbers were automatically converted to integer or float, what needed much processing time. The important speedup obtained with C was due to the fast character search (for ',' and '\n') and a fast update of the string values to be modified in the file. Mention that the modified integer values keep the same length so that we could create a copy of the .csv files and overwrite 5-digit strings only where necessary (2 attributes out of 38 in our case).
In our specific application, writing the program in C provided a substantial performance gain. But for our development environment for which csv file manipulation is only a part, Python is preferable to C, in particular for programming the graphical interface. We plan to use Python by default, and to call C libraries written for time critical operations like csv data processing.
We presented an application of integer pseudonymisation in ASCII files for which the C language appeared to offer a substantial speedup over Python. But more than this conclusion about performances, we want to underline the possibility to call C libraries from Python, what allows the developer to benefit from the speed of C without sacrificing the readability and easiness of Python. In this respect, consider the two blogs: https://cylab.be/blog/235/calling-c-from-python 'Calling C from Python' and 'Creating a dynamic library in C' https://cylab.be/blog/234/creating-a-dynamic-library-in-c.