Restructuring codebases for humans & AI

In January I went all in on AI-driven development. February brought the follow ups: applying the approach across different tech stacks, experimenting, and learning about the consequences of treating code as less important.

Secrets Management & Infrastructure Security

I never liked .env files: they’re a poor way to inject secrets into containers (these can be read with docker inspect), get out of sync with VCSed .env.example, and sooner or later someone shares them with colleagues via Slack.

So I put in some time (finally) to look for a better way to manage my secrets. I deployed Infisical as a self-hosted secrets manager. I got a nice UI, secrets grouped per environment, and an easy way to render secrets into containers as files instead of env vars.

I still have a small .env file, but it’s encrypted with SOPS + age , so I can safely commit it to a git repo. The only secret necessary to run the app is the age key to decrypt that file. I store it in my Bitwarden. Finally Infisical Agent renders and mounts template files into containers, so I no longer pass secrets as env vars.

Supabase Platform & Database Consolidation

With AI agents generating 99.9% of the code I write, n8n completelly lost its value to me. The whole point of drag-and-drop was to avoid writing code. Now that code gets generated anyway, I see value in having proper code which I can test, review, monitor.

I chose Supabase as my personal n8n replacement. I’ve been using Postgres as a database to power up my workflows, and I like the general feeling of Supabase stack so the choice was pretty easy.

I didn’t like the amount of boilerplate necessary to start the things up, but it turned out to make sense. Now I’m turning my n8n workflows to Supabase Edge functions.

I’m self hosting Supabase at Hetzner. I’ve been already using pgroll to manage my DB schema, so I recreated the DB from scratch (supabase postgres image comes with quite a lot of customizations over regular PG), applied schema migrations, and dumped only data from my old database.

I’m still using pgroll to maintain Supabase DB schemas, and will keep it in the stack.

Quantia Agents — From n8n to Python

At Quantia we’ve used n8n as a prototyping platform for a bunch of internal AI Agents. It worked fantastic, yet we knew from the beginning that for production we’d move towards some code based solution. So we moved to LangGraph .

The migration itself was smooth. Every n8n workflow had been developed with openspec first, then implemented via MCP. Moving to LangGraph was just replaying the same specifications against a different framework. Good spec coverage made this almost boring.

Interesting parts came after code generation. For the first agent on LangGraph I set myself a challenge: build and maintain it without reading the codebase (too much). Design at the level of big blocks, never go deeper than class names, let AI handle the internals. The early results ain’t amazing.

The first thing I deployed wasn’t a feature. It was LangFuse so we can observe how people interact with the agent, collect traces, and act based on data rather than the loudest complaints.

The main learning here’s that striving for top level quality, while treating the code as unimportant detail is fucking hard, and it drives way more changes than I anticipated.

Rebuilding dbt for AI Agents

The LangGraph migration highlighted a broader problem: how do we structure codebases so AI agents can work with them efficiently? The shift isn’t just adding more context to CLAUDE.md. It may require serious changes to code structure, so that the agent can find all the related pieces and reason about them.

What I struggled with most was git repository boundaries. AI can work across repositories, but it’s far from perfect at such. Things like pre-commit hooks, claude workspaces, minor inconsistencies that appear from different people working on different parts of codebases. This all comes to the surface.

At Quantia I work the most with data pipelines, and infrastructure. So far we had quite a standard setup of projects:

dbt – repository with all the SQL models, responsible for data transformations,
gitops – repository with kubernetes cluster config & apps deployment,
terraform – repository with infrastructure foundations.

This split came mostly from Conway’s Law: I worked on dbt, my colleague worked on kubernetes, and we both touched Terraform.

Then I worked on a bigger change to data pipeline that required changes to all three repos. It contained SQL model changes, Argo Workflow changes, and some permissions updates on bare GCP level.

So, instead of accepting the fact that I need to create three PRs in three repos I flipped the entire thing. I extracted related Argo Workflows config from gitops to dbt repo. I moved GCP permissions config from terraform to dbt. Now the AI can reason much better about the necessary changes.

To take it further I wrapped all the Quantia repos with a meta repo. A single repository bundling all the repos as submodules, with standardized structure for every developer, and started building docs for both humans and AI about various dependencies.

Doing this showed me three things that an AI-driven development environment requires:

Insanely more testing — automated, broad, running often,
Self-healing production that tolerates AI mistakes,
Easy rollbacks for when it doesn’t heal

Predictions Pipeline

I took over my first ever data science project. It came to me as a set of Jupyter notebooks. I honestly hate that tech. I know it brings a lot of prototyping, and experimentation benefits, but the amount of bad practices it allows is insane. Version control is a pain, CI setup is tricky, extracting experiments into reusable code modules is hard.

I never used it personally to a high level, but I worked with enough smart people who delivered poor quality to start thinking there’s something off with the tool, not the users.

When working on this project I made a bet with myself – do it without even running the notebooks themselves. Let AI port the codebase to Python modules, and interact only with that refactored version.

How do I know I’m keeping the functionality, while working with tech stack I’m completely unfamiliar with?

I never worked on any predictions data science,
I never used Optuna for hyperparameter tuning nor LightGBM ,
I had no idea how the notebooks work internally
Definition of “works” vs “doesn’t work” in data science is really hard

This didn’t bring many new learnings, but reinforced the already mentioned ones:

I merged the Python modules into dbt repository so AI could easily find out the database structure and dependencies,
I found a bunch of low hanging fruits for automation: data ingestion, data cleanup,
I looked for small building blocks that could be extracted and unit tested,
Performance optimizations to speed up the execution. I was able to shorten feedback loop from a few hours running the entire pipeline to about an hour by optimizing memory usage

This is still in progress, but I learned (again) that insane level of testing automation, and giving the AI Agent a way to validate its work is super important.

Kubernetes Scaling & FlareSolverr Saga

FlareSolverr is a proxy for bypassing Cloudflare protection, running headless Chromium under the hood. We’d been using it as sidecar containers on a small scale. With traffic growth we spun up a dedicated cluster of FlareSolverr pods as an internal service.

After the release, pods started OOM-crashing. So I kept adding memory — 3Gi → 4Gi → 6Gi → 7Gi. No luck. Tried different traffic distributions: more nodes with fewer pods, fewer nodes with more pods, different concurrency per process. Still crashing.

Then I noticed something: despite all the crashes, the service was fine. New pods replaced the old ones, jobs got processed, nothing was actually broken. Chrome leaks memory — that’s just what it does.

The simplest fix turned out to be no fix. I accepted the leak, brought limits back down, settled on the least-failing combination of pods and concurrency, and adopted KEDA autoscaling.

When there’s no crawl demand, the entire node pool scales to zero — nodes don’t exist, can’t crash, cost nothing. As soon as jobs hit the Redis queue, KEDA spins up the first node. Additional nodes scale based on resource utilization of existing ones.

Poland Technology Grant

My group of mentees got filtered out by reality — out of 6 people I’m actively working with 4. The rest receives my emails, and either keeps on rescheduling appointments, or plainly ignores them. I tried to motivate them in the beginning, but now I accepted it, and learned that I prefer to have more sessions with those seeing the value in them.

With those I have quite an experience of getting out of my bubble. They’re third and fourth year engineering students, yet the gap between my approach to work and theirs is insane. I can’t put into words my surprise when I saw that all of them copy-pasted code from Gemini to their editors.

Like really, I showed them gemini-cli, NotebookLM, claude code, and it was a total new thing to them. But the best came just after. One of them came to the next class all excited:

I gotta show you something! It’s fucking cool! I did this thing with Gemini, check this out!

He then demoed me an app to generate Minecraft models. You could basically select some configurations in the sidebar, of what you want to build (a bridge, a tower, a house, etc.), configure some parameters (width, height, complexity, etc.), and you would get a full 3D model rendered, with ability to rotate it, view it, inspect. It would also generate instructions on how to get that built in Minecraft.

I was mindblown. Not because of tech choices, but because I had no idea that such a niche existed, that you can get such performant 3D modelling with textures, lights, all those features in the browser. And also – because he had ABSOLUTELY no idea how this thing worked!

What libraries did you use?
I don’t know
Why do you have python here to run the web server, while the entire project is JS?
No idea

A few months ago my inner coder would disagree with this. But now, I’m thinking how can we embrace such workflows? What questions should I ask him? How much abstraction layer should we go down with understanding?