How Much Speed Can You Really Squeeze Out of Python? [Cython/C]
Python is a very popular language due to its beginner-friendly nature, rapid prototyping capabilities, and the support of a large and active community. It’s also known for its stability and wide range of use cases.
However, some developers hold back from using Python in certain scenarios due to concerns about execution speed. It’s true that compared to compiled languages like C/C++, Go, and Rust, Python can be slower. But it’s important to remember that speed isn’t the only factor to consider in every project, and Python’s strength lies in its nature as a scripting language.
A few days ago, I shared this on my LinkedIn profile.
You can follow me on Linkedin. https://www.linkedin.com/in/surajsinghbisht/
Here’s where things get interesting: The Pareto Principle (also known as the 80/20 rule) tells us that for many outcomes, roughly 80% of the effects come from 20% of the causes. This applies to Python optimization as well.
By carefully balancing development speed, security, and ease of use, we can optimize a focused 20–30% of our codebase to achieve significant performance gains (around 80%). This makes Python a compelling choice for many projects.
In this article, I’ll share my approach to maximize the performance. But first, let’s explore why Python is perceived as slow and the underlying reasons behind its performance, particularly considering its interpreter is written in C.
Here are the reasons from my personal perspective:
- When you write code in C, it gets translated directly into machine code by a compiler before running. This machine code is fast because the computer can understand it directly. On the other hand, Python code is translated into intermediate code by the interpreter every time you run it. This means there’s an extra step involved, which makes Python slower compared to C.
- C allows for direct manipulation of memory and data types, which can lead to more efficient code execution. In Python, data types are dynamically typed and objects are created with overhead that hides all complexity, which can result in slower performance.
- Function calls in Python involve more steps, such as dynamic type checking and resolution, whereas C function calls are more direct.
- Python’s dynamic nature and built-in safety features, such as bounds checking on arrays, can add overhead that impacts speed compared to C
And Many More!
However, upon closer examination, you’ll find that all these reasons contribute to making Python easy to use, stable, and secure.
Nevertheless, we can address these drawbacks in different scenarios, and I’ll discuss a few ways to do so.
A few days ago, I was working on a task involving log files, each of which exceeded 1GB-1.5GB in size. My goal was to search for specific lines containing multiple search strings. Using Linux GNU tools like WC, GREP, TAIL, and SED made this task very easy and fast. These tools performed searches almost instantly.
However, when attempting the same task in Python, the process was slower. While the search operations with the linux tools took about 0.5–1.5 seconds, in Python, they took 6–10 seconds.
Therefore, to achieve the highest possible speed, I experimented with various approaches, and I will share each one with you in turn.
Demonstration
Let’s assume we have a requirement to write a Python module that should offer functions to perform find and count operations on a large text file, similar to grep and wc. For this example, I’m using a simple text log file with the following details:
File name is : 2024–02–08.log, Size : 835Mb.
It contains Lines: 1556731 , Word: 12424653, Characters : 874872451.
Problem Statement:
Retrieve all lines from a text log file that include specific string patterns.
for this demo, I’m looking for lines in a file that contain both “api.testuser” and “2024–02–08 20”.
This task is easily accomplished using Linux tools.
grep -F “api.testuser” 2024–02–08.log | grep -F “2024–02–08 20” | wc -l
We identified 2 lines that matched our conditions.
Now, let’s see how long it took to find these lines.
time grep -F “api.testuser” 2024–02–08.log | grep -F “2024–02–08 20” | wc -l
Hmm, 0.65 Seconds. Fast!
Now, Let’s try to achieve this speed in python.
Try #1 :
import time
# SECONDS : 7.17~
# Output : 2 Lines
def getLogLines(filepath, str_patterns):
output = []
with open(filepath, 'rb') as fp:
for line in fp.readlines():
# checks if all the string patterns provided
# in str_patterns are present in that line
if all(((pattern in line) for pattern in str_patterns)):
output.append(line)
return output
start_time = time.time()
result = len(getLogLines(b'logs/2024-02-08.log', [b'api.testuser', b'2024-02-08 20' ]))
end_time = time.time()
print(f'Time Take : {end_time-start_time}')
print(f"Result Count : {result}")
The code took approximately 7.17 seconds to execute, and it found 2 lines that matched the given conditions.
Improvements #1:
List Comprehension: Instead of using a for loop to iterate over each line in the file and then checking if it contains all the specified string patterns. By using list comprehension, iterates over each line in the file and filters out only those lines that satisfy the condition specified by the all() function.
https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions
Try #2:
import time
# SECONDS : 3.20~
# Output : 2 Lines
def getLogLines(filepath, str_patterns):
output = []
with open(filepath, 'rb') as fp:
output = [line for line in fp.readlines() if all(((pattern in line) for pattern in str_patterns))]
return output
start_time = time.time()
result = len(getLogLines(b'logs/2024-02-08.log', [b'api.testuser', b'2024-02-08 20' ]))
end_time = time.time()
print(f'Time Take : {end_time-start_time}')
print(f"Result Count : {result}")
The code took approximately 3.20~ Seconds.
Improvements #2:
Reading the file line by line using fp.readlines()
can be slow because it reads one line from the disk on every iteration, copies it to memory, and then performs string pattern comparison.
A faster approach involves reading larger chunks of data at once and then performing the comparison, repeating this process until the end of the file. This can be achieved using the mmap
module, which is provided as part of the standard Python library.
Memory Mapping involves creating a memory-mapped file object using mmap.mmap()
. This allows the file to be directly mapped into memory, eliminating the need for explicit reading operations.
https://docs.python.org/3/library/mmap.html
Try #3:
import mmap
import time
# SECONDS : 1.97~
# Output : 2 Lines
def getLogLines(filepath, str_patterns):
str_patterns = tuple(str_patterns)
output = []
with open(filepath, 'rb') as fp:
map_file = mmap.mmap(fp.fileno(), 0, access=mmap.ACCESS_READ)
output = [line for line in iter(map_file.readline, b"") if all(((pattern in line) for pattern in str_patterns))]
return output
start_time = time.time()
result = len(getLogLines(b'logs/2024-02-08.log', [b'api.testuser', b'2024-02-08 20' ]))
end_time = time.time()
print(f'Time Take : {end_time-start_time}')
print(f"Result Count : {result}")
Time Taken : 1.97~ Seconds.
Impressive! We’ve managed to reduce the execution time from 7 seconds to just 2 seconds. This significant improvement was achieved simply by writing Python code thoughtfully and efficiently.
Now, Let’s mix a little flavor of C/Cython aiming to match the speed of the wc
and grep
commands.
To further enhance speed, we need to implement the core execution logic in the C language. Why C? Because the CPython Interpreter itself is implemented in C.
To utilize C code within Python, we need the Cython compiler. Cython simplifies the process of creating C extensions for Python, making it as straightforward as writing Python code itself.
To install Cython on your system, refer to the installation instructions provided on the official website.
Create three files with the name ccorelib.c, ccore.pyx and test.py
ccorelib.c : Core execution logic implemented in pure C code.
ccore.pyx : Cython code serving as a bridge between C and Python.
test.py : Python code to use ccorelib.c functions
File : ccorelib.c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
typedef struct {
char **lines;
unsigned int line_counter;
int error;
} LogLinesResult;
LogLinesResult fp_get_log_lines(char* filepath, char **qarray, unsigned int q_size, unsigned int line_size, unsigned int max_lines){
LogLinesResult res;
FILE *in_fp;
unsigned int match_found;
res.line_counter = 0;
res.lines = (char **)malloc(max_lines*sizeof(char *));
// open file
in_fp = fopen(filepath, "r");
if (!in_fp){
res.error=-1;
return res;
}
char line[line_size];
while(fgets(line, sizeof(line), in_fp)){
match_found=0;
for (unsigned int i = 0; i < q_size; ++i) {
if (strstr(line, qarray[i])){match_found++;}else{break;}
}
if(match_found==q_size){
res.lines[res.line_counter++] = strdup(line);
}
if(!(res.line_counter<max_lines)){break;}
}
fclose(in_fp);
return res;
}
File : ccore.pyx
from libc.stdlib cimport malloc, free
# cythonize -i ccore.pyx
cdef extern from "ccorelib.c":
ctypedef struct LogLinesResult:
unsigned int line_counter
char **lines
int error
cdef LogLinesResult fp_get_log_lines(char* filepath, char **c_str_array_to_search, int str_list_len, int line_size, int max_lines)
def getLogLines(file_path:bytes, py_str_array_to_search:list[bytes]):
cdef int str_list_len = len(py_str_array_to_search)
cdef char **c_str_array_to_search = <char **>malloc(str_list_len*sizeof(char *))
output = -1
if file_path and py_str_array_to_search:
if c_str_array_to_search is NULL:
return output
for i, v in enumerate(py_str_array_to_search):
c_str_array_to_search[i]=v
res = fp_get_log_lines(file_path, c_str_array_to_search, str_list_len, 8192, 2048*10)
output = []
for index in range(res.line_counter):
output.append(res.lines[index])
free(c_str_array_to_search)
return output
The Cython compiler command to compile the code in ccore.pyx into a Python importable extension.
cythonize -i ccore.pyx
Try #4:
test.py
import time
from ccore import getLogLines
start_time = time.time()
# Time Take : 0.5917747020721436
# Result Count : 2
result = len(getLogLines(b'logs/2024-02-08.log', [b'api.testuser', b'2024-02-08 20' ]))
end_time = time.time()
print(f'Time Take : {end_time-start_time}')
print(f"Result Count : {result}")
Time Taken : 0.59~ Seconds.
Our progress in improving execution speed.
Attempt #1 With Python : 7.17~ Seconds
Attempt #2 With Python and List Comprehensions : 3.20~ Seconds
Attempt #3 With Python and mmap : 1.97~ Seconds
Attempt #4 With Python and C/Cython : 0.59~ Seconds
Conclusion.
Yes! Python may have its moments of slowness, but with clever strategies, we can effectively overcome these limitations.
That’s it for this article.
Don’t forget to follow me on Linkedin, and comment your opinion.