AI doesn't get programmed. It gets trained.
Regular software is a list of rules a person wrote down: if this, then that. Modern AI works almost the opposite way. Nobody sits down and writes the rule for "what makes a good sentence" or "how to spot sarcasm." Instead, the system is shown a staggering number of examples and left to work out the patterns on its own.
It roughly happens in three stages. First comes pretraining: the model reads an enormous slice of the internet, books, and code, and learns to do one boring-sounding thing extremely well — predict the next word. Do that across trillions of words and something strange emerges: fluency. The model can write, summarize, and answer. But raw fluency has no judgment. It will confidently say something wrong in exactly the same tone it uses to say something right.
Then comes the human part. People sit down with the half-trained model and teach it manners and judgment: they write example answers, rank two responses to say which is better, flag mistakes, and steer it away from toxic or dangerous output. This stage — often called fine-tuning and reinforcement learning from human feedback — is where a fast-talking text generator becomes something that feels helpful, careful, and honest. Third comes ongoing refinement, as new feedback rolls in long after launch.
The invisible humans in the loop
So who are these people? Mostly, you'll never hear about them. Researchers call them "ghost workers" — a vast, largely contract workforce scattered across the globe that labels the data and grades the answers that AI learns from. Their day looks like this: tagging objects in images, transcribing audio, sorting text into categories, comparing two AI replies and marking the better one, writing model answers from scratch, and — the grimmest job — reviewing violent or abusive content so a filter can learn to block it.
The scale is hard to picture. The World Bank estimates that somewhere between 154 million and 435 million people do online gig work of some kind. The International Labour Organization counted online labor platforms jumping from around 142 in 2010 to more than 777 by 2020, with the number of people earning on them rising from roughly 43 million in 2018 to about 78 million by 2023 — and those figures miss the millions more hired through subcontractors. A big share of that hidden army feeds AI.
Most of this work is routed to countries with lower wages and lighter labor rules: Kenya, the Philippines, India, Venezuela, Colombia, Pakistan, Nigeria, and increasingly China. The pay is the part that tends to stop people cold. Across much of the Global South, basic labeling pays roughly $1 to $2 an hour; piece work can mean a few cents per task. One widely reported case found workers in Kenya reviewing toxic content for a major AI company took home somewhere around $1.32 to $2 an hour. A Venezuelan worker described earning between five and twenty-five U.S. cents per task, juggling five platforms at once from home. The same labeling that might cost $15 an hour in the United States becomes "economically viable" at a tenth of that elsewhere — geographic arbitrage, with humans as the commodity.
The companies arranging this are large and mostly invisible to the public: Scale AI (valued at roughly $14 billion), Appen, Sama, iMerit, and labeling marketplaces like Remotasks, DataAnnotation, and Outlier. They tend to call themselves technology platforms rather than employers, which is convenient when questions about wages and conditions come up. When workers in Kenya organized a union to push for better terms, projects reportedly drifted away from the unionized providers.
Here's the twist, though: as models got smarter, the work started to change. The newest models don't need help telling a cat from a dog anymore — they need experts who can catch a subtle medical error, untangle a contract, or judge a tricky bit of reasoning. So a second tier of this market has appeared, and it pays real money. Generalist gigs on sites like DataAnnotation start around $20 an hour; specialist projects start near $40. Domain experts command consultant rates — reportedly $150–$300 an hour for doctors labeling radiology data, $200–$400 for lawyers annotating contracts, $100–$250 for finance pros. The "sweatshop data" era isn't over. But a more skilled, better-paid layer is growing on top of it.
The best data isn't on the internet
Now the part that reframes everything. Public AI learned from the open web — and the open web, for all its size, is missing the most valuable knowledge on earth. The truly useful data almost never gets published. It sits locked inside private networks: hospital records and scans, bank transaction logs, insurance claim files, legal case archives, factory machine telemetry, and the internal documents of every serious company.
The volume is almost comical. One 2025 estimate put the world's enterprise-generated unstructured text alone at over 2.5 zettabytes, with healthcare on its own producing something like 120 petabytes a year of clinical notes, imaging, and genomic data. A general-purpose AI has seen essentially none of it. It can't — that data lives behind logins, privacy law, and compliance walls it was never given a key to.
This is why a public chatbot can sound brilliant about medicine in general and still know nothing about how your hospital actually treats a condition. It read the textbooks. It never saw the charts.
The smarter AI you'll never meet
Inside those walls, something quietly powerful is happening. Companies are building specialized models — sometimes called small or domain-specific language models — fine-tuned on their own private data. In their narrow lane, these specialists routinely beat the giant general-purpose models everyone talks about.
A model trained on a single factory's sensor history can predict equipment failures with an accuracy no outside AI could match, because no outside AI ever saw that data. Healthcare systems fine-tune models on internal patient records to read scans and notes faster and more accurately than a generalist. Purpose-built medical models and a hospital's own internal diagnostic tools understand clinical context that trips up a general chatbot. In a 2025 enterprise survey, more than two-thirds of organizations that deployed specialized models reported better accuracy and faster returns than with general-purpose AI.
This is what people in the industry mean by a "data moat" — your proprietary data is the "crown jewels," the one thing competitors can't copy. Some now call private domain data "the new gold." And the uncomfortable implication is this: the single smartest AI about your exact medical condition, your industry, or your legal situation may already exist — behind a door you'll never be allowed to open. It doesn't live on a website. It lives inside a bank, a hospital, or a defense contractor.
The doors in the wall are called APIs
Walls have doors, though — and in software, the door is an API. An API is a controlled gateway that lets one system ask another for specific data or actions without handing over the whole vault. The bank doesn't give you its database; it gives you a narrow, permissioned window. That distinction is the entire game.
Here's the practical secret most people miss: a huge amount is already reachable, if you know which API holds it. Government open-data portals, company filings through systems like the SEC's EDGAR, mapping and geolocation, weather, financial market feeds, scientific and medical reference databases — all sitting behind public or semi-public APIs, waiting for someone to ask correctly.
So the quietly valuable skill of the next few years isn't "knowing everything." It's knowing which door holds the answer and how to knock. An AI on its own is a brilliant generalist with a fixed memory. An AI pointed at the right API becomes something far more useful — current, specific, and grounded in real data instead of guesswork. AI plus the right gateway beats AI alone, every time.
So does this get better — or worse?
Honestly, both at once. Two forces are pulling in opposite directions.
The case for worse: data moats reward hoarding. The more valuable a company's private data, the more incentive it has to lock the smartest AI inside its own walls and never let the public near it. The cheapest labeling work stays cheap and stays hidden, with companies sidestepping the workers who try to organize for better terms. The most capable AI in any field could end up being the one fewest people can touch.
The case for better: regulators are starting to pull back the curtain — new transparency rules in Europe are pushing the hidden labor of AI into the light. The expert tier of training work is paying genuine professional wages, which is a real shift in dignity and money. APIs keep multiplying, opening more of those locked doors to anyone who learns the knock. And the general models are converging fast — by early 2025 the top ten foundation models were clustering within about five points of one another on the common benchmarks, which means raw "general intelligence" is starting to behave like a commodity. As that floor rises, the advantage of simply gatekeeping a model narrows.
The likely outcome is a strange split: the floor keeps rising — everyone gets a capable general AI almost for free — while the ceiling gets locked higher, with the truly specialized systems sealed inside private networks. The gap between "the AI everyone has" and "the AI that actually knows your world" becomes one of the defining lines of the next decade.
The takeaway
Next time an AI answer feels like magic, remember what it really is: a stack of human choices. The data someone decided to collect. The example someone labeled for two dollars. The expert who graded its reasoning for two hundred. The door someone left open as an API — or quietly bolted shut. The intelligence is real. It just isn't magic, and it was never only the machine's.