Data fuckups, data learnings ::

When I started my coding adventures in 2011, I got hooked on pretty much any tech that crossed my path. Backend with Rails, frontend with CSS & React, infrastructure automation, Ansible, Docker, you name it.

I was all over it. Except for all things data related. I saw them as boring. Unworthy of my attention.

Then I joined Quantia.ai – a company for which the data is the core of the business. I quickly understood how wrong I was. Data engineering is not only interesting, but freaking challenging.

This experience made me revisit some of the fuckups I’ve been a part of. Here, I’m sharing some lessons learnt from them.

These are real situations from projects I’ve been a part of over the last ~10 years. I skipped company & project names for privacy.

#1. Data Quality⌗

Let’s start with a B2B project. We delivered reminders about on-site visits to our customers’ clients via text messages. They synced their contact database, and we handled the rest. The goal of these messages was to decrease the number of no-shows for scheduled appointments.

At some point, we received feedback about that particular feature. “It doesn’t work,” it said. After some digging, we found out what “doesn’t work” really meant: the messages were delivered, but people still missed their appointments.

We looked at the monitoring system, checked the logs, but everything seemed to work:

(some) messages were delivered;
(some) clients came for their appointments;
(some) clients replied to cancel or reschedule.

With this knowledge we started fixing, right away. Tweaking text content, sender names, delivery times, and message quantity. Number of no-shows didn’t change at all.

With a lack of better ideas, one of the team members started looking at the customers’ data, in search of inspiration. He ran a simple query, something like:

select phone_number from customers;

Then just looked at the results, finding that:

plenty of phone numbers were incorrect;
there were numbers like +1 123 123 123, 00 000 000 000;
numbers were duplicated between customers;
we even had numbers like SECRET, no, or simply NULL values.

Had we only run that query a week earlier… But, better late than never, yeah? Having that knowledge, instead of tweaking messages delivered to 123123123 we:

fixed up a proper validation wherever we could;
worked with our customers to improve their registration forms;
started sending emails on top of text messages.

This way, we decreased the number of no-shows, without a single change to the messages’ content. All we had to do was to check the data before jumping into solutions.

Learning: Never trust external data.

#2. Web scraping⌗

I once imagined that scraping the internet was a straightforward task. Fire up Selenium, write a script, and voilà – done.

This may be true for a hobby project, like scraping a bunch of pages from a local machine. But trying to scale such an approach quickly becomes a disaster.

One of the first web scraping projects I did was a tool to collect product offerings from a bunch of online shops. I approached it with a CSS-selectors based crawler. Very quickly this became a nightmare to maintain, dealing with randomized selectors, like: extract_data(".235s34__product > h1.fh33c > span:first-child").

To make things worse, the structure changed from time to time making the crawlers fail. After deploying my crawler to a VPS, I got errors from Cloudflare identifying my requests as bots, and showing a captcha page. A couple of times I DDoSed a smaller page with too much traffic, taking it offline.

Scaling a web scraping pipeline is a topic for at least a book. So, I limit myself to just a few tips here, and a list of useful tools:

most of the websites nowadays use some kind of API to communicate between the frontend and the backend – figure out how to become a client of that API;
LLMs are great at extracting data from messy sources, so use them (mind the pricing though);
always store the raw data you crawled – storage is cheap and raw data may turn out to be invaluable in the future.

Useful scraping tools⌗

Some tools I found useful when dealing with web-scraping tasks:

https://www.scrapingbee.com/ – Web scraping API that simplifies data extraction from websites, including JavaScript-rendered pages.
https://www.firecrawl.dev/ – Open-source tool that converts websites into structured data or markdown.
https://oxylabs.io/ – Proxy services that help users gather public data while avoiding IP blocks.
https://www.webshare.io/ – Residential and data center proxies for efficient and anonymous data access.
https://www.zyte.com/ – Provides cloud-based web scraping tools.
https://github.com/FlareSolverr/FlareSolverr – Open-source middleware that helps bypass anti-bot protections like Cloudflare.
https://github.com/lwthiker/curl-impersonate – A special build of curl that impersonates browsers’ TLS signatures.
https://github.com/lexiforest/curl_cffi – Python bindings for curl-impersonate via cffi.

Learning: Web scraping ain’t easy.

#3. Alerts⌗

It all starts innocently. Customer reports a problem. Engineer develops a fix. Someone complains about a dissatisfied client. So, with the best intentions, an alert is setup: when such-and-such happens, send a message here-and-there.

Been there, done that. I implemented tons of such alerts, believing I was doing great deeds. Until, half a year later, I’d start my day with at least a dozen alerts in a few channels. An issue with infrastructure here, a JavaScript error there, and a Rails exception in between. 95% of them required no action at all.

So my colleagues and I learned to ignore these notifications. From time to time though, a production outage would happen that raised loud alarms. But we were so used to the ever-present stream of alerts, nobody noticed them anymore.

“How did we end up here!?” someone asked at the next retrospective meeting. “We must clean this mess up!” another one replied. So we’d mute some alerts, only to add new ones a week later. Another failure, another retrospective, and the cycle repeats.

This is what I call Alert Madness. And, even though I still haven’t figured out a good solution to avoid it completely, I’m convinced that the checklist below will help me (at least) delay the pain.

Alert Sanity Checklist⌗

add alerts only for things requiring intervention;
be clear who is responsible for acting when the alert arrives;
foresee how to act when the alert arrives;
make it easy to mute irrelevant alerts;
regularly review the alerts and remove the ones no longer necessary.

Learning: Alert fatigue is a real thing.

Data fuckups, data learnings

Table of Contents

#1. Data Quality⌗

#2. Web scraping⌗

Useful scraping tools⌗

#3. Alerts⌗

Alert Sanity Checklist⌗

To be continued…⌗