A practical tour of modern AI

AI moves fast and the vocabulary moves faster. This page walks through the ideas you actually need to make good decisions - how today’s generative AI differs from older machine learning, what models and agents really are, where they run, what they cost, and the skills your team needs to use them well.

Paradigm Shift LLMs vs Agents Modalities Tokens & Performance Hardware Parameters & RAM Quantization Context Windows Skills to Learn

01 - The Paradigm Shift

Generative AI vs Traditional Machine Learning

Traditional machine learning is narrow and predictive. Modern generative AI is broad and creative. Both are useful, but they solve very different problems.

Traditional ML

Built and trained for one specific job - predict churn, score a loan, detect a defect on a production line. Trained on labeled examples from your business and locked to that task.

Narrow, single-purpose models
Outputs a number, label, or score
Needs your data and labels to train
Cheap to run, predictable

Generative AI

A single foundation model trained on enormous, general data that can write, summarize, translate, code, draw, and reason across thousands of tasks - often with no task-specific training at all.

General-purpose, many tasks at once
Outputs new text, images, audio, code
Steered with prompts and context
Heavier to run, can hallucinate

Rule of thumb: if the answer is a number or a category and you have clean historical data, traditional ML is often the better fit. If the answer is language, an image, a decision, or a draft, reach for generative AI.

02 - Models vs Systems

LLMs vs AI Agents

An LLM is the brain. An agent is the brain plus hands, memory, and a goal. Knowing the difference helps you scope projects realistically.

Large Language Model

A model that takes text in and produces text out. It has no memory between calls, no tools, and no ability to act in the world on its own. ChatGPT, Claude, and Grok in their basic chat form are LLMs you talk to one turn at a time.

Good at: drafting, summarizing, answering, translating, coding.

AI Agent

A system built around an LLM that can plan, call tools (search, email, your CRM, a database), remember what happened, and loop until a goal is met. An agent doesn’t just answer - it gets things done.

Good at: multi-step workflows, research, scheduling, triage, follow-ups.

Plainly: an LLM answers “what should the email say?” An agent reads the inbox, decides which message to reply to, drafts it, checks your calendar, and sends the reply - using an LLM at every step.

03 - Modalities

Text, Images, Audio, Video

A “modality” is just a type of data. Modern frontier models are increasingly multimodal - they can read, see, hear, and generate across formats in one conversation.

Text

The most mature modality. Drafting, summarizing, translation, classification, extraction, code generation, and reasoning.

Images

Generate marketing visuals, product mockups, and illustrations - or feed photos in for inspection, OCR, and visual Q&A.

Audio

Real-time speech-to-text, natural-sounding voices, voice agents, meeting transcription, and music or sound generation.

Video

The newest frontier. Short-clip generation, video understanding, scene description, and editing assistants are improving rapidly.

04 - The Currency of AI

Tokens & Frontier Model Performance

What is a token?

A token is the chunk of text a model actually reads and writes. It’s usually a short word or a piece of one. As a rough rule of thumb, 1 token ≈ 4 characters, or about 750 words per 1,000 tokens.

Tokens matter because they decide three things at once: how much you can fit into a single request, how fast the response feels, and how much it costs.

What to expect from frontier labs

“Frontier” models are the largest, most capable systems from labs like OpenAI, Anthropic, Google, xAI, and Meta. Performance varies by model and load, but as a working baseline:

Streaming speed

50–200 tok/s

How fast text appears as it’s generated.

Time to first token

0.3–2 s

Lag before the answer starts streaming.

Cost per million tok

$0.10–$15

Cheap small models to top-tier reasoning.

Reasoning modes that “think” before answering trade speed for quality - expect longer waits and higher token usage in exchange for noticeably better answers on hard problems.

05 - Where AI Runs

AI Servers vs Edge Devices

AI runs in two very different places: massive data-center servers packed with specialized GPUs, and the laptop, phone, or appliance sitting in front of you. Each has real strengths.

AI Servers (Cloud GPUs)

Racks of data-center accelerators - NVIDIA H100 / H200 / Blackwell, AMD MI300, Google TPUs - with tens to hundreds of gigabytes of ultra-fast memory each, networked together to run the largest frontier models.

Runs the biggest, smartest models
Scales to thousands of users at once
Pay-per-token, no hardware to own
Data leaves your building

Edge (NPUs & Consumer GPUs)

Modern laptops and phones ship with NPUs (Neural Processing Units) for low-power AI, and consumer GPUs like the NVIDIA RTX series or Apple Silicon can run surprisingly capable models locally with no internet round-trip.

Data never leaves the device
Works offline, low latency
No per-token bill
Limited to small & mid-size models

Many real systems blend both: a cheap local model handles routine tasks instantly, and a frontier cloud model is called in for the hard ones.

06 - Sizing a Model

Parameter Size & Memory

A model’s “parameters” are the numbers it learned during training. More parameters generally means more knowledge and better reasoning - and a bigger memory footprint to run.

A simple way to estimate the RAM (or VRAM) you need: each parameter takes 2 bytes at standard 16-bit precision. So a 7-billion parameter model needs roughly 14 GB of memory just to load, plus extra headroom for the conversation itself.

Model size	RAM at 16-bit	RAM at 4-bit	Where it fits
1–3 B	~2–6 GB	~1–2 GB	Phones, NPUs, any modern laptop
7–8 B	~14–16 GB	~4–5 GB	Mainstream laptops, mid-range GPUs
13–14 B	~26–28 GB	~8–10 GB	Workstations, enthusiast GPUs
30–34 B	~60–70 GB	~18–22 GB	High-end workstations, single data-center GPU
70 B	~140 GB	~40–48 GB	Multi-GPU servers
400 B+	800 GB+	200 GB+	Frontier-lab clusters only

Approximate values. Real usage depends on architecture, batch size, and how much context you load.

07 - Making Models Smaller

Quantization

Quantization compresses a model by storing each parameter with fewer bits - for example, dropping from 16 bits down to 8, 4, or even fewer. The model gets dramatically smaller and faster, with only a small dip in quality.

It’s the single biggest reason capable AI now runs on laptops and phones. A 70-billion parameter model that needs a server at full precision will often run on a high-end consumer GPU once it’s 4-bit quantized.

As a rough guide: 8-bit is nearly indistinguishable from the original, 4-bit is the sweet spot for local use, and anything below starts to noticeably affect quality on harder tasks.

Bits per parameter

16-bitfull quality, full size
8-bit~50% smaller
4-bit~75% smaller
2–3 bitaggressive, lossy

08 - The Model’s Working Memory

Context Windows

The context window is everything the model can “see” at once: your instructions, the conversation so far, any documents you pasted in, and the answer it’s building. It’s measured in tokens.

Small

8K tokens

~6,000 words. A short report or a few emails.

Standard

128K tokens

~300 pages. A typical business document or codebase folder.

Large

1M tokens

~2,500 pages. Entire books or a small codebase at once.

Frontier

2M+ tokens

A full library of reference material in one prompt.

Bigger is not always better. Models often pay less attention to material buried deep in a long context, and every token in the window is a token you’re paying for. Curate what you send.

09 - Working With AI

Skills to Build to Use AI Effectively

The biggest gains from AI don’t come from buying a fancier model - they come from people who know how to drive it. These are the skills worth investing in.

Prompt Engineering

Writing clear, specific instructions: give the model a role, show examples, state the format you want, and tell it what to do when it’s unsure. The single highest-leverage AI skill.

Context Curation

Knowing what to put in the prompt and what to leave out. Feeding the right document, the right examples, and the right constraints - not everything you have.

Verification & Critical Reading

Treat AI output as a confident first draft, not a final answer. Spot hallucinations, check sources, and never publish anything you haven’t read.

Model & Tool Selection

Knowing when to use a small fast model, when to spend on a reasoning model, and when an agent or workflow is the better tool than a chat window.

Workflow Design

Breaking real work into steps an AI can do well, with humans in the loop at the right moments. The skill that turns a clever demo into a real productivity gain.

Privacy & Data Hygiene

Knowing what is safe to paste into which tool, when to use a private model, and how to handle customer data responsibly.