A practical tour of modern AI
AI moves fast and the vocabulary moves faster. This page walks through the ideas you actually need to make good decisions - how today’s generative AI differs from older machine learning, what models and agents really are, where they run, what they cost, and the skills your team needs to use them well.
01 - The Paradigm Shift
Generative AI vs Traditional Machine Learning
Traditional machine learning is narrow and predictive. Modern generative AI is broad and creative. Both are useful, but they solve very different problems.
Traditional ML
Built and trained for one specific job - predict churn, score a loan, detect a defect on a production line. Trained on labeled examples from your business and locked to that task.
- Narrow, single-purpose models
- Outputs a number, label, or score
- Needs your data and labels to train
- Cheap to run, predictable
Generative AI
A single foundation model trained on enormous, general data that can write, summarize, translate, code, draw, and reason across thousands of tasks - often with no task-specific training at all.
- General-purpose, many tasks at once
- Outputs new text, images, audio, code
- Steered with prompts and context
- Heavier to run, can hallucinate
Rule of thumb: if the answer is a number or a category and you have clean historical data, traditional ML is often the better fit. If the answer is language, an image, a decision, or a draft, reach for generative AI.
02 - Models vs Systems
LLMs vs AI Agents
An LLM is the brain. An agent is the brain plus hands, memory, and a goal. Knowing the difference helps you scope projects realistically.
Large Language Model
A model that takes text in and produces text out. It has no memory between calls, no tools, and no ability to act in the world on its own. ChatGPT, Claude, and Grok in their basic chat form are LLMs you talk to one turn at a time.
Good at: drafting, summarizing, answering, translating, coding.
AI Agent
A system built around an LLM that can plan, call tools (search, email, your CRM, a database), remember what happened, and loop until a goal is met. An agent doesn’t just answer - it gets things done.
Good at: multi-step workflows, research, scheduling, triage, follow-ups.
Plainly: an LLM answers “what should the email say?” An agent reads the inbox, decides which message to reply to, drafts it, checks your calendar, and sends the reply - using an LLM at every step.
03 - Modalities
Text, Images, Audio, Video
A “modality” is just a type of data. Modern frontier models are increasingly multimodal - they can read, see, hear, and generate across formats in one conversation.
Text
The most mature modality. Drafting, summarizing, translation, classification, extraction, code generation, and reasoning.
Images
Generate marketing visuals, product mockups, and illustrations - or feed photos in for inspection, OCR, and visual Q&A.
Audio
Real-time speech-to-text, natural-sounding voices, voice agents, meeting transcription, and music or sound generation.
Video
The newest frontier. Short-clip generation, video understanding, scene description, and editing assistants are improving rapidly.
04 - The Currency of AI
Tokens & Frontier Model Performance
What is a token?
A token is the chunk of text a model actually reads and writes. It’s usually a short word or a piece of one. As a rough rule of thumb, 1 token ≈ 4 characters, or about 750 words per 1,000 tokens.
Tokens matter because they decide three things at once: how much you can fit into a single request, how fast the response feels, and how much it costs.
What to expect from frontier labs
“Frontier” models are the largest, most capable systems from labs like OpenAI, Anthropic, Google, xAI, and Meta. Performance varies by model and load, but as a working baseline:
Streaming speed
50–200 tok/s
How fast text appears as it’s generated.
Time to first token
0.3–2 s
Lag before the answer starts streaming.
Cost per million tok
$0.10–$15
Cheap small models to top-tier reasoning.
Reasoning modes that “think” before answering trade speed for quality - expect longer waits and higher token usage in exchange for noticeably better answers on hard problems.
05 - Where AI Runs
AI Servers vs Edge Devices
AI runs in two very different places: massive data-center servers packed with specialized GPUs, and the laptop, phone, or appliance sitting in front of you. Each has real strengths.
AI Servers (Cloud GPUs)
Racks of data-center accelerators - NVIDIA H100 / H200 / Blackwell, AMD MI300, Google TPUs - with tens to hundreds of gigabytes of ultra-fast memory each, networked together to run the largest frontier models.
- Runs the biggest, smartest models
- Scales to thousands of users at once
- Pay-per-token, no hardware to own
- Data leaves your building
Edge (NPUs & Consumer GPUs)
Modern laptops and phones ship with NPUs (Neural Processing Units) for low-power AI, and consumer GPUs like the NVIDIA RTX series or Apple Silicon can run surprisingly capable models locally with no internet round-trip.
- Data never leaves the device
- Works offline, low latency
- No per-token bill
- Limited to small & mid-size models
Many real systems blend both: a cheap local model handles routine tasks instantly, and a frontier cloud model is called in for the hard ones.
06 - Sizing a Model
Parameter Size & Memory
A model’s “parameters” are the numbers it learned during training. More parameters generally means more knowledge and better reasoning - and a bigger memory footprint to run.
A simple way to estimate the RAM (or VRAM) you need: each parameter takes 2 bytes at standard 16-bit precision. So a 7-billion parameter model needs roughly 14 GB of memory just to load, plus extra headroom for the conversation itself.
| Model size |
RAM at 16-bit |
RAM at 4-bit |
Where it fits |
| 1–3 B |
~2–6 GB |
~1–2 GB |
Phones, NPUs, any modern laptop |
| 7–8 B |
~14–16 GB |
~4–5 GB |
Mainstream laptops, mid-range GPUs |
| 13–14 B |
~26–28 GB |
~8–10 GB |
Workstations, enthusiast GPUs |
| 30–34 B |
~60–70 GB |
~18–22 GB |
High-end workstations, single data-center GPU |
| 70 B |
~140 GB |
~40–48 GB |
Multi-GPU servers |
| 400 B+ |
800 GB+ |
200 GB+ |
Frontier-lab clusters only |
Approximate values. Real usage depends on architecture, batch size, and how much context you load.
07 - Making Models Smaller
Quantization
Quantization compresses a model by storing each parameter with fewer bits - for example, dropping from 16 bits down to 8, 4, or even fewer. The model gets dramatically smaller and faster, with only a small dip in quality.
It’s the single biggest reason capable AI now runs on laptops and phones. A 70-billion parameter model that needs a server at full precision will often run on a high-end consumer GPU once it’s 4-bit quantized.
As a rough guide: 8-bit is nearly indistinguishable from the original, 4-bit is the sweet spot for local use, and anything below starts to noticeably affect quality on harder tasks.
Bits per parameter
- 16-bitfull quality, full size
- 8-bit~50% smaller
- 4-bit~75% smaller
- 2–3 bitaggressive, lossy
08 - The Model’s Working Memory
Context Windows
The context window is everything the model can “see” at once: your instructions, the conversation so far, any documents you pasted in, and the answer it’s building. It’s measured in tokens.
Small
8K tokens
~6,000 words. A short report or a few emails.
Standard
128K tokens
~300 pages. A typical business document or codebase folder.
Large
1M tokens
~2,500 pages. Entire books or a small codebase at once.
Frontier
2M+ tokens
A full library of reference material in one prompt.
Bigger is not always better. Models often pay less attention to material buried deep in a long context, and every token in the window is a token you’re paying for. Curate what you send.
09 - Working With AI
Skills to Build to Use AI Effectively
The biggest gains from AI don’t come from buying a fancier model - they come from people who know how to drive it. These are the skills worth investing in.
Prompt Engineering
Writing clear, specific instructions: give the model a role, show examples, state the format you want, and tell it what to do when it’s unsure. The single highest-leverage AI skill.
Context Curation
Knowing what to put in the prompt and what to leave out. Feeding the right document, the right examples, and the right constraints - not everything you have.
Verification & Critical Reading
Treat AI output as a confident first draft, not a final answer. Spot hallucinations, check sources, and never publish anything you haven’t read.
Model & Tool Selection
Knowing when to use a small fast model, when to spend on a reasoning model, and when an agent or workflow is the better tool than a chat window.
Workflow Design
Breaking real work into steps an AI can do well, with humans in the loop at the right moments. The skill that turns a clever demo into a real productivity gain.
Privacy & Data Hygiene
Knowing what is safe to paste into which tool, when to use a private model, and how to handle customer data responsibly.