Materials to brush up your Python skills
Asynchronous programming is becoming more and more popular in Python since the introduction in Python 3.5 (with PEP 492) of two keywords: await
and async
.
It is important to be aware of few facts about Python performance:
the GIL (Global Interpreter Lock) from the C Python implementation puts a strong limitation in Python performance: programs with strong computing requirements cannot be accelerated with multithreading. Multiprocessing is an option, but it comes with its set of limitations;
Multithreading remains however relevant when a program uses a lot of blocking calls. These includes I/O calls (writing a file, sending a request on the web, accessing USB devices): multithreading releases the GIL during these blocking calls, and allows the program to perform other tasks while the first task is waiting for a blocking call to complete.
Asynchronous programming, also known under the name of a Python module asyncio
, provides a single threaded efficient implementation of programs made of blocking calls.
If the following Spongebob is rather multiprocessing:
The tl;dr version of asyncio
goes as follows:
blocking calls are annotated with the await
keyword. The Python interpreter (and its main loop) will put this function on hold until the reply comes, and proceed with different asynchronous calls;
functions with an await
keyword in their implementation must be prefixed with the async
keyword; an async
function must be await
ed. This may sound like egg or chicken, and that actually may make it all confusing before you get used to it.
So let’s start with a function which does nothing more than sleeping:
async def count():
print("one")
await asyncio.sleep(1)
print("two")
If you run it once, it will take… one second:
>>> import asyncio
>>>
>>> loop = asyncio.get_event_loop() # the loop in charge of sequencing async calls
>>> loop.run_until_complete(count())
one
two
But if you run several calls together, it will also take one second. Check the printing order: the loop schedules the next call of count()
when it hits on an await
instruction:
>>> loop.run_until_complete(asyncio.gather(count(), count(), count()))
one
one
one
two
two
two
RuntimeError: This event loop is already running.
It is however possible to run a cell with an `await` keyword. The following code is valid in Jupyter but not in Python:await asyncio.gather(count(), count(), count())
In practice, many libraries made of blocking calls provide an asynchronous version of their code, which becomes relevant if you need to make many small blocking calls, e.g. many small downloads, or many calls to a database.
requests
is the most common library for synchronous http requests. For this example, let’s download all flags of the world from https://flagcdn.com/.
The full list of flags is available at the following link:
import requests
c = requests.get("https://flagcdn.com/fr/codes.json")
c.raise_for_status()
codes = c.json()
# >>> codes {'ad': 'Andorre', 'ae': 'Émirats arabes unis', 'af': 'Afghanistan',
# 'ag': 'Antigua-et-Barbuda', 'ai': 'Anguilla', 'al': 'Albanie', 'am':
# 'Arménie', 'ao': 'Angola', 'aq': 'Antarctique', 'ar': 'Argentine', ...
Now we can time the synchronous download of all flags:
from tqdm import tqdm
for c in tqdm(codes.keys()):
r = requests.get(f'https://flagcdn.com/256x192/{c}.png')
r.raise_for_status()
# ignoring content for this example
100%|█████████████████████████████████████████████████████████████| 306/306 [01:15<00:00, 3.77it/s]
One of the most widespread libraries for asynchronous web requests in aiohttp
which syntax is somehow similar. The proper code would be here:
import aiohttp
import time
async def fetch(code, session):
async with session.get(f"https://flagcdn.com/256x192/{code}.png") as resp:
return await resp.read()
async def main():
t0 = time.time()
async with aiohttp.ClientSession() as session:
futures = [fetch(code, session) for code in codes]
for response in await asyncio.gather(*futures):
data = response
print(f"done in {time.time() - t0:.5f}s")
asyncio.run(main())
done in 0.52194s
# with requests
requests.get(url, proxies={"http"=proxy, "https"=proxy})
# with aiohttp
async with session.get(url, proxy=proxy)
We will implement a particular case of webcrawling in this example, with a breadth first exploration in a graph.
Pick the final identifier in the URL (here Q9439
) and replace it in the JSON URL
URL | |
---|---|
Wikidata item | https://www.wikidata.org/wiki/Special:EntityPage/Q9439 |
JSON file | https://www.wikidata.org/wiki/Special:EntityData/Q9439.json |
Explore the JSON and find specific relationships in the claims
dictionary: P22
for the “father” relationship, P25
for the “mother” relationship and P40
for the “children” relationship. Find new identifiers for members of extended family in those dictionaries.
Explore the entries for all neighbours of the current entry. Pay attention to stick to breadth-first exploration: explore all kins directly related to Queen Victoria, then all kins with two degrees of relationship, etc.
Draw the genealogic subtree (consider the networkx
package) with Queen Victoria, Queen Elizabeth II and the Duke of Edinburgh.
Extend the graph with another grandfather of Europe, Christian IX of Denmark. The late British royal couple was also cousin through this branch. Look at their relationships with other cousins, like Nicholas II of Russia (the last tsar of Russia), or Felipe VI of Spain, current King of Spain.
You will find a suggestion of solution in the asyncio.ipynb
notebook.