A Field Guide to Today's AI Models

April 18, 2026 · 13 phút đọc

Every few months I find myself explaining the same thing to a friend: no, “AI” isn’t one monolithic thing — it’s a loose family of very different models, each with its own quirks, training story, and reason it works. So I finally sat down and wrote the tour I wish someone had handed me. Here it is.

For each family you’ll get a sense of what it does, the algorithm underneath, why that algorithm is a good fit for the problem, how painful it is to train, and roughly how much compute you need to throw at it.

1. Large Language Models (LLMs)

You already know the names: GPT-4, Claude, Llama, Gemini, Mistral. What they’re all doing, at the bottom, is almost embarrassingly simple — given a stream of tokens (think word-pieces), predict the next one. Scale that single objective up enough and the same model that started as a next-token predictor ends up writing code, summarizing papers, translating, and holding a conversation.

In practice you run into LLMs all day without naming them. The chatbot that drafts your reply, the autocomplete suggesting the next line of code, the “summarize this thread” button in your email client, the customer-support agent that answers before a human ever sees the ticket, the tool that turns a messy PDF into structured JSON — all the same machinery underneath. The interesting shift of the last couple of years is that LLMs stopped being just text-in, text-out: wire one up to tools and it plans a multi-step task, call it in a loop and it becomes an agent, hand it a codebase and it refactors across files.

But “LLM” hides an important split: large language models and small language models are different beasts, and most of the confusion in the field comes from conflating them. A frontier model — think 100B+ parameters, served from someone else’s data center — is a generalist. It handles messy instructions, obscure domains, long chains of reasoning, and it does so at a latency and cost that only make sense when the task is genuinely hard. A small language model — 1B to 8B parameters, the kind that runs on a laptop or even a phone — is a specialist. It’s faster, cheaper, private (the data never leaves the device), and when you fine-tune it on your specific task it often matches or beats the frontier model on that one task.

The practical rule I’ve landed on: reach for a large model when the problem is open-ended or you don’t yet know what “good” looks like, and reach for a small model when the task is narrow, repeated millions of times, or has to run somewhere a GPU cluster can’t. A lot of production systems end up using both — a big model to prototype and generate training data, a small fine-tuned model to actually serve traffic.

2. Computer Vision

Vision is where deep learning really convinced everyone it was going to work, back in the ImageNet days. It’s also where the field has churned through the most architectures, so let’s split it up.

2a. Convolutional Neural Networks (CNNs)

ResNet, EfficientNet, ConvNeXt. The classic image workhorses. You use them to classify images, extract features, or serve as a backbone for detection and segmentation.

A CNN is a stack of convolution layers that learn local filters — edges first, then textures, then object parts, then whole objects. Pooling shrinks the resolution as you go up, and residual connections (the “Res” in ResNet) keep gradients flowing even when the network gets very deep.

Two priors are baked in, and they’re the reason CNNs work so well on so little data: locality (nearby pixels matter more than distant ones) and translation equivariance (a cat is a cat whether it’s in the top-left or the bottom-right). A fully-connected net has to learn both of those from scratch.

Training is moderate difficulty. A standard ResNet-50 on ImageNet finishes in hours to days on a handful of GPUs, and good data augmentation (crops, flips, color jitter, MixUp) tends to matter more than people expect. Compute-wise, this is tiny compared to LLMs — ResNet-50 training is around 10¹⁸ FLOPs, and fine-tuning a pretrained CNN runs on a single consumer GPU.

2b. Vision Transformers (ViT)

ViT, DeiT, Swin, DINOv2. Same jobs as CNNs, but the architecture is a Transformer. You chop the image into patches (say 16×16 pixels), linearly embed each patch, add positional embeddings, and feed the resulting sequence into a standard Transformer encoder.

Attention across patches captures global context much earlier in the network than stacked convolutions ever could. The trade-off: at large data scales ViTs beat CNNs, and at small scales the CNN’s built-in priors still win. That’s the rule of thumb I use.

ViTs are more finicky to train than CNNs — data-hungry, less stable without cosine schedules, stochastic depth, and strong augmentation. Self-supervised pretraining (DINO, MAE) is what unlocks them on smaller datasets. Compute is comparable to large CNNs at standard sizes, though the frontier vision models — DINOv2, SAM — have ballooned into the tens of billions of parameters and want a serious cluster.

2c. Object Detection and Segmentation

YOLO (v8/v9/v10), DETR, Mask R-CNN, Segment Anything. These locate objects — either with bounding boxes (detection) or pixel-precise masks (segmentation).

The standard pattern is backbone plus head. A backbone (CNN or ViT) extracts general visual features; a task-specific head turns those into boxes, class labels, and masks. DETR uses attention-based set prediction; YOLO uses dense grid predictions tuned for speed. Both approaches work; they just optimize for different trade-offs.

The pain here isn’t the algorithm, it’s the data. Every object needs a box or a mask, and that labeling is expensive. Loss balancing between classification and localization is also finicky enough to eat a week of your life if you’re not careful. On the compute side, training YOLO on COCO takes hours on a handful of GPUs; SAM, with its billions of masks, was a much bigger lift.

2d. Diffusion Models (Image Generation)

Stable Diffusion, DALL-E 3, Midjourney, Flux. The ones generating images from text prompts.

The idea is almost unreasonably elegant. You take a real image, add Gaussian noise to it in small steps until it’s pure noise, and train a network to reverse that process — predict the noise that was just added. At inference you start from pure noise and iteratively denoise, usually guided by a text embedding from something like CLIP or T5. Turning one hard generation problem into a sequence of many easy denoising problems is the whole trick, and classifier-free guidance is the dial you turn to trade diversity against prompt fidelity.

Training is expensive. You need LAION-scale paired image-text data (billions of pairs), careful noise schedules, and a long run. Stability is better than the GAN era but hyperparameters still bite. Stable Diffusion-scale training is in the hundreds of thousands of dollars; frontier commercial models are in the millions. Inference fits on a single GPU but the 20–50 denoising steps make it slower than a classifier.

3. Speech

3a. Speech-to-Text (ASR)

Whisper, wav2vec 2.0, Conformer, Canary. These turn audio into text — often with timestamps and language detection thrown in.

The pipeline is audio → spectrogram (or raw waveform features) → Transformer or Conformer (that’s a Transformer with convolution blocks mixed in) → text. The decoding head is typically CTC, an encoder-decoder with cross-attention, or an RNN-transducer. Different trade-offs between streaming, accuracy, and latency.

Spectrograms are the clever move: they make speech look like a 2D image over time and frequency, which both Transformers and convolutions already know how to handle. Whisper’s other clever move was training on 680k hours of weakly-labeled audio scraped from the web, which is where its robustness to accents and noise comes from.

Labeled speech data is scarce, so in practice you either do self-supervised pretraining (wav2vec-style) or weak supervision (Whisper-style). Aligning long audio with long transcripts is its own small nightmare. Whisper-large took weeks on hundreds of GPUs. Inference, though, is cheap — Whisper-base runs in real time on a laptop CPU, which is quietly one of the best deals in ML.

3b. Text-to-Speech (TTS)

ElevenLabs, Tacotron 2, VALL-E, XTTS, Bark. These go the other direction: text in, natural-sounding speech out, sometimes in a target voice.

Traditionally there are two stages — an acoustic model that turns text into a mel-spectrogram, and a vocoder (HiFi-GAN, WaveNet) that turns the spectrogram into a waveform. The modern twist, VALL-E and friends, reframes TTS as language modeling over discrete audio tokens from a neural codec like Encodec or SoundStream. Once your audio is tokens, the whole LLM playbook comes with it.

The reason decomposing helps is that each intermediate representation — phonemes, spectrograms, audio tokens — turns a continuous, high-dimensional problem into a more tractable one. Training is moderate difficulty; the catch is that high-quality voice data is expensive to produce, and nailing prosody and emotion is still an open research problem. Compute is much smaller than LLMs — a single multi-GPU box is enough for many open models, and real-time inference is standard.

3c. Speaker and Audio Classification

Speaker verification (ECAPA-TDNN), wake-word detection, audio event classification (YAMNet). Smaller, quieter models, usually a CNN or Transformer over spectrograms, trained with contrastive or classification losses. Training is easy-to-moderate on datasets like VoxCeleb and AudioSet. Compute is small enough that these often run on-device — your phone and smart speaker are running this category.

4. Multimodal Models

CLIP, GPT-4V, Claude, Gemini, LLaVA, Flamingo. Models that handle more than one modality — usually images plus text, sometimes audio and video too.

There are two recipes worth knowing. The contrastive recipe (CLIP) trains an image encoder and a text encoder side-by-side so that matching image-caption pairs end up with high cosine similarity and non-matching pairs don’t. The generative recipe (GPT-4V, LLaVA) takes features from a vision encoder, projects them into an LLM’s embedding space, and lets the LLM attend to them as if they were just more tokens.

Both recipes exploit the same observation: images and their descriptions carry overlapping information, so forcing them into a shared space (or a shared token stream) lets a model generalize across modalities with surprisingly weak alignment data. CLIP needed 400M image-text pairs. Large multimodal LLMs bolt a vision instruction-tuning stage onto an already-expensive base LLM, so you’re paying twice.

5. Recommendation Systems

YouTube’s two-tower model, TikTok’s ranker, Netflix’s recommenders. The models that decide what you see next.

The usual shape is a two-tower or multi-tower deep network — one tower embeds the user, another the item — with a dot product (plus some re-ranking) producing the score. You train it on billions of click or watch events with binary cross-entropy or ranking losses. Embeddings end up capturing latent preferences and item properties that hand-engineered features miss, and the sheer volume of interaction data makes up for a pretty simple objective.

The algorithmic side is moderate. The infrastructure side is where it gets serious: logging pipelines, feature stores, online learning, A/B testing, freshness guarantees, data skew. Training compute per model is smaller than LLMs, but companies run dozens of these and retrain continuously, so the aggregate is enormous.

6. Reinforcement Learning

AlphaGo, AlphaZero, OpenAI Five, and the RLHF post-training step that turns a raw LLM into something you can talk to. RL learns policies that maximize reward in sequential decision problems.

You have a policy network that picks actions, a value network that estimates future reward, and a training loop that either runs policy gradients (PPO), value-based methods (DQN), or self-play for two-player games. When you can simulate the environment cheaply (games) or get a reliable reward signal (human preferences), RL can discover strategies no human ever labeled.

The reputation for being hard is earned. Rewards are sparse, exploration is hard, training is unstable, and results swing with the random seed. Compute varies wildly — AlphaZero burned through thousands of TPUs, while RLHF for an LLM is cheap relative to pretraining, which is still a lot of compute by anyone else’s standards.

7. Graph Neural Networks (GNNs)

GCN, GraphSAGE, GAT, and the Evoformer inside AlphaFold (graph-ish). These live on graph-structured data — molecules, social networks, knowledge graphs, transaction networks.

The core loop is message passing: each node updates its representation by aggregating from its neighbors. Pick a different aggregation (sum, mean, attention) and you get a different GNN variant. Explicit graph structure is a strong inductive bias whenever relationships matter more than absolute identity, which is exactly the situation in drug discovery, fraud detection, and recommendations.

Training is moderate. Scaling to billion-edge graphs forces you into careful neighborhood sampling. Most GNN research still fits on a single GPU.

8. Time-Series and Forecasting

Classical (ARIMA, Prophet) and deep (N-BEATS, TimesFM, Chronos). Forecasting future values — demand, prices, weather, sensors.

Classical models fit parametric forms directly. Deep models use Transformers or specialized architectures that tokenize continuous values or use patch-based inputs. What’s interesting is that large pretrained time-series foundation models — Chronos, TimesFM, Moirai — are starting to show that one model can forecast many unseen series zero-shot, which is exactly the LLM trick applied to numbers.

Classical methods fit in seconds on a CPU. The new foundation models pretrain on large archives of time series and want a GPU cluster, though still far less than LLMs do.

Quick Comparison

Family	Core algorithm	Training difficulty	Typical compute
LLMs	Decoder Transformer	Very high	10²⁴–10²⁶ FLOPs; thousands of GPUs
CNNs	Stacked convolutions	Low–moderate	Single to handful of GPUs
Vision Transformers	Patch Transformer	Moderate	Handful to hundreds of GPUs
Object detection	Backbone + heads	Moderate–high	Handful of GPUs
Diffusion (image gen)	Denoising U-Net/DiT	High	Hundreds of GPUs
Speech-to-text	Transformer/Conformer + CTC	Moderate–high	Handful to hundreds of GPUs
Text-to-speech	Acoustic model + vocoder or codec LM	Moderate	Single multi-GPU box
Multimodal (CLIP, VLMs)	Dual encoders or VLM	High	Large clusters
Recommenders	Two-tower deep nets	Moderate (infra-heavy)	Modest per model, massive in aggregate
Reinforcement learning	Policy gradient / self-play	Very high	Varies wildly; can be huge
GNNs	Message passing	Moderate	Single GPU usually
Time-series	Stats or patch Transformer	Low–moderate	CPU to small cluster

Patterns that keep showing up

Step back and a few threads run through all of this. The Transformer has quietly colonized vision, speech, and even time-series — the same attention mechanism keeps outperforming domain-specific alternatives the moment you have enough data. Self-supervised or weakly-supervised pretraining (MAE for images, wav2vec for audio, next-token for text) consistently beats training from scratch on labeled data. And the scaling-laws mindset — predictable gains from more data, parameters, and compute — has spread from LLMs into speech, vision, and time-series foundation models.

Which means the practical question, if you’re picking a model today, is almost never “which architecture.” It’s “which pretrained checkpoint do I fine-tune.” That shift — from building models to adapting them — is probably the most important thing to internalize from all of this.

Previous Note