A propos/About RSS Tags

Notes from PyConFR'2025

Martin Kirchgessner, 2025-11-03

I'm writing this coming back from the french branch of PyCon, that I attended for the first time. I was happy to attend as a speaker, I could give a short talk about my biggest 2025 work, swh-fuse.

This article lists talks I saw. I'm noting things I learned, so don't expect much literature nor a complete transcript but hopefully you'll learn something too!

How's made CPython

Program entry

Presents the CPython development process: PRs (CPython is on GitHub since 2022), issues, projects, PEPs.

Who's behind:

5 people in steering council (elected by core developers)
4 release managers - engaged for 6 years, notably decide which feature gets added during betas. We're already at 3.15a1 btw.
100 core developers (~50 active), review and merge.
- the commiter is responsible for maintainance, not the author!
and, bug triagers and occasional contibutors

There are also working groups (structured by PEPs) on documentation, packaging, typing, C API, security and translation.

Python language summit might happen during EuroPython (instead of PyCon US) in 2026. Maybe.

Ansible at scale (with a twist)

Program entry

More particularly about AWX (ansible's web interface, uses Django). Which is relevant when

there are "many" inventories/teams/projects (they launch >600k jobs per month over >7k hosts), it provides RBAC and auditing (can log to logstash)
scheduling is needed
staff is not command-line-oriented

AWX relies on execution environments

like venvs, for Ansible
implemented as an Ansible controller in a container

They inserted a custom (FastAPI) app that generates inventories on the fly, pulled from another inventory app.

For remote access they create SSH certificates with a self-hosted authority, so every new machine only has the authority's public key and users create short-term certificates when needed. This makes rollovers much easier. AWX connects to Hashicorp's Vault to generate those certificates to authenticated users.

⚠️ Red Hat is not publishing (free) new images of AWX. IBM continues to push it open source, so you can build it yourself but it's going micro-services and they don't publish build scripts. Or, just buy Ansible Automation Platform 🤡

Le duo comique accélère une suite de tests Django/Pytest

speeding up a Django/Pytest test suite / Program entry (full room !)

pytest can give the top-10 slowest tests with --durations=10.

tips:

don't hash password like in production (configure it from a fixture so prod stays encrypted), especially when creating many users during tests
don't always enable coverage measurement
django's pytest has --reuse-db that can be nice when repeating tests locally

parallelism with pytest -n is nice but don't use auto, speedup is really low, usually.

Advices for GitlabCI + Docker runners

parallelism on all cores causes slowdowns
deploy a postgres instance for tests, so gitlab-runner is not pulling/running PG (docker pull is their slowest step)
PG database templates can skip the migrations step

Travailler avec des Data Lakehouses en Python, sans Spark

Working with data lakehouses in python without spark / Program entry

Lakehouse = mixing data warehouse and data lake... that is, separate storage and compute, on scalable and governable platforms. Also allows transactions, snapshotting, branching, schema versioning.

Often rely on open table formats: delta ("Delta Universal Format aka UniForm" from databricks), iceberg (netflix), hudi (netflix), ducklake (duckDB).

DeltaLake = a folder of [Parquet files + transation logs in JSON files].

In Python, on small-medium scale, we can avoid Spark entirely with packages

deltalake (Rust w/ python binding) - this talk mostly covers deltalake.
pyiceberg (pure python)
Polars
DuckDB, as of Oct 2025, can read delta tables.

deltalake can connect directly to an S3-compliant storage. Schema and data can be exported to Arrow.

write_deltalake takes a dataframe but also has the mode option: can append, overwrite, upsert, etc. and it's actually creating a new version of the dataset and only stores the diff. Tables also have a merge operation that is inspired from Spark, ie. it matches with conditions per row. Their superstar method is load_as_version(i), to time travel in the dataset.

Speaker is from Grenoble and now works on laketower, a CLI+Web app to manage lakehouse table.

Oh there's a Kafka to Delta ingester.

Visualiser les réseaux sociaux à échelle planétaire, un milliard de pixels à la fois

Visualizing social networks worldwide, one million pixel at a time / Program entry

Speaker's company sells social network analysis, public studies include finding (political) spam networks. Makes demos with Gephi, an open-source project for visual network analysis. Interesting (but not sure it scales).

They work on implementing ForceAtlas visualization algo (implements gravity between nodes), but takes 24h for 10m nodes, they're working on scale above 100m using GPUs. Also they export those graphs ... in SVG.

If you need big data, write it in XML

htmx et Django : retour d’expérience 3 ans plus tard

3 years of Django and html / Program entry

Note to self: watch the replay (because htmx ♥️), but it was in parallel of something more surprising:

How to solve a Python mystery

Program entry

This talk did not contain much Python, it instead gave many hints to observe a Python process as a black blox.

There are so many linux observability tools !

/proc can leak useful information about a process:

readlink /proc/{pid}/cwd
cat /proc/{pid}/environ | tr '\0' '\n'
ls -l /proc/{pid}/fd gives all file descriptors as symbolic links

🏴‍☠️ this command also annotates file(s) that have been deleted, but if they're still open by the process you can still tail or cat from /proc/{pid}/fd/ID... which is great when accidentally deleting the log file of a running app 🏴‍☠️

when you only have user-level access you can still install strace, by downloading strace and libunwind, and run LD_LIBRARY_PATH=$(pwd)/libunwind strace

favorite flags:

strace -f -ttt -o output.txt -s 1024 -p {pid}

-f is --follow-forks
-s for larger screens - important to see long strings. Like, DB queries.
-ttt for precise time

Then you might like the searchable linux syscall table.

I should iostat -d 3 more often.

HOWTO quick'n'dirty benchmark write thoughput:

dd if=/dev/zero of=~/block.tmp bs=8k count=20480 oflag=direct

oflag=direct tells the kernel to not cache. 8K is a postgres block, try 4k to match the usual FS block size. When it's a NAS, always ask about the local block size, watch out for the difference when matching that size!

network hints:

Use tcptraceroute when the network does not allow UDP nor ICMP.
When there's a ~40ms delay (or 25 timeouts per second), "it's always TCP_NODELAY" (because of Nagle's algorithm vs delayed ack). Speaker's PR to librdkafka about that is still open
in such cases it's also interesting to ask "what would curl do ?" - that is, compare your trace to strace curl
be careful that /proc/sys/net/ipv4/tcp_keepalive* defaults are bad

advises to monkey-patch with libkeepalive:

export LD_PRELOAD=/path/to/libkeepalive; ./my_timeouted_app

Virtual Environments and Lockfiles — How Python Is Improving Reproducibility

Program entry / slides and code

a history of how we went from requirements.in to lock files. Promoting private repos (to filter authorized packages) and uv.

Computer vision data version control & reproducibility at scale

Program entry (presented by someone else)

Represents LakeFS, sold as git for data. Basically manages subsets in an object store bucket, and provide the API to present the data at a given point in time. With web interfaces. Speaker couldn't develop how LakeFS differs from DeltaLake.

Who does Python trust, and why ?

Program entry

Pretty pedagogic reminder of public key infrastructure. Python does not uses system's trust store, but usually relies on the certifi package. When using private authorities, requests can watch for the REQUESTS_CA_BUNDLE environment variable.

More recently (py 3.10) truststore is recommended as a certifi replacement. Its selling point is

import truststore
truststore.inject_into_ssl()

and let requests go. Used by pip by default since 24.2

swh-fuse, or how to put a whole code archive behind a folder

oh i'm in the program!

slides are here

Refactoring at Scale: Making Analytics Type-Safe with Codemods and AI

Program entry

A Sentry developer tells the story of a migration from loosely-typed objects and weird classes families using too many kwargs, to mypy-friendly dataclasses. "at scale" because that concerned 200+ Event classes having 300+ record() implementations.

As theres' static analysis they tried Cursor. Didn't work well, even with attempts at prompt engineering. Instead they used codemods, that works directly on ASTs, with libCST because it can preserve comments and docstrings. In that case Cursor can vibe-code transformers, to avoid starting from scratch with that new tool. That worked well.

Second problem was that refactoring in a single PR would touch 200 files that should have been validated by >30 different teams. Instead they tried to split by team or topic, so each MRs would touch ~5 files. So they had to know who owns what! Again they vide-coded, this time a git history miner.

The refactoring is done but it also highlighted pre-existing problems (incorrect data pieces in the wrong object, unexpected kwargs).

Advises that AI can assist tooling, not execution. Instead reliable codemods clearly outperformed LLMs.

Python Native Interface for faster and stronger Python

Initial title was Universal Python Extensions: Performance, Compatibility, Sustainability, and Less CO₂ (as in the program). But Python Native Interface emerged during EuroPython2025 (last summer) so this talk got a new title.

Py3.14 now embeds a just-in-time compiler, so speaker benchmarked a simple function to see how things are going compared to CPython 3.11 and it's not that great:

	without JIT	with JIT
CPy 3.9	0.74
CPy 3.11	1.00
Cpy 3.14	1.28	1.07
PyPy		16.2
GraalPy		17.9

(yes CPy3.14 is slower with the JIT) To get faster, pythonistas either use native extensions or perf-oriented implementations of Python. But you can't mix those two approaches! And just-in-time compilers can only work on the Python code, of course.

PyPy is still is on par with Python 3.11, Microsoft cut fundings to the "faster cpython" project, so both are losing traction.

The language itself make things harder. Some choices in the C API make things harder too, there's too many boxing/unboxing, etc. So HPy started a few years ago, by PyPy and GraalPy devs. It proposes a new C API for Python, modernized. It's GC-friendly, particularly for moving garbage collection. HPy has two target ABIs: the current one, and a new, universal one (cross-python-versions). For example it proposes a fully opaque PyObject that allow extensions to be compatible with both modes (this will prevent native extension from playing with PyObject fields directly - many do that, but it's harmful to GC). But HPy is stalled: lacking CPython core devs, funding, big projects.

and now there's PyNI, that took the old idea mixed wiyh HPy ideas and it's a PEP draft so it might gain traction.

Speakers points out that those projects are currently depending too much on private investments, whereas a lab or university usually don't contribute to Python although they use it and could benefit from making it faster (and are already spending X00K€s on Matlab...).

Démarrage Python : mesurer avant d’optimiser

Python startup: measure before optimizing / Program entry

Context: datadog wants to speed up Python startup to improve their AWS Lamba use.

So python standard library has cProfile and pstats. Speaker recommends snakeviz to visualize stats. But cProfile is not that great: it slows down the app, can't see native code, hardly helps when multi threaded/process.

cProfile actually relies on sys.monitoring.register_callback.

Recommends pyspy instead: can hook itself onto the app, has --native --subprocesses - however you'll have to use perf to actually measure in Rust stacks (then, merge stats manually). Pyspy can export to a standard format that is understood by the speedscope visualizer

I'm tired but that was nice

The conference proposed in parallel 4 talks and 2 or 3 workshops, so this is only a subset! Have a look at the complete program by yourself. Talks were recorded, keep an eye on AfPy's news to watch complete talks... soon.

I was surprised that talks in English were in a single track, instead of having tracks grouped by topic. But did manage to ride all those stairs to attend overall very interesting talks. I'm super thankful to organizers: the venue was comfortable, we had a great saturday night too, learned plenty things, met interesting people. Great!

Time to get back to work, though.

How's made CPython

Ansible at scale (with a twist)

Le duo comique accélère une suite de tests Django/Pytest

Travailler avec des Data Lakehouses en Python, sans Spark

Visualiser les réseaux sociaux à échelle planétaire, un milliard de pixels à la fois

htmx et Django : retour d’expérience 3 ans plus tard

How to solve a Python mystery

Virtual Environments and Lockfiles — How Python Is Improving Reproducibility

Computer vision data version control & reproducibility at scale

Who does Python trust, and why ?

swh-fuse, or how to put a whole code archive behind a folder

Refactoring at Scale: Making Analytics Type-Safe with Codemods and AI

Python Native Interface for faster and stronger Python

Démarrage Python : mesurer avant d’optimiser

I'm tired but that was nice

htmx et Django : retour d’expérience 3 ans plus tard

Démarrage Python : mesurer avant d’optimiser