Base64 is everywhere in modern software: embedded images in CSS, JWT tokens, the Authorization: Basic header, MIME email attachments, signed cookies, even the occasional
ETag. It is the de-facto answer to one specific problem — moving arbitrary binary bytes through
a channel that was designed for printable text — and it is wildly overused for problems it does
not solve.
This post explains exactly what Base64 does at the byte level, where the famous 33% overhead comes from, what changes in the URL-safe variant, and when reaching for Base64 is the wrong call. By the end you should be able to tell, at a glance, whether a given payload belongs in Base64 at all.
What Base64 Actually Does
Base64 is defined in RFC 4648 (which obsoleted the older RFC 3548 and the even older MIME definition in RFC 2045). The mechanism is mechanical:
- Take the input as a stream of 8-bit bytes.
- Regroup the bits into 6-bit chunks.
- Map each 6-bit value (0–63) to one character in the alphabet
A–Z a–z 0–9 + /. - If the input length is not a multiple of 3, pad the output to a multiple of 4 with
=characters.
The arithmetic is the source of every property people care about. Three input bytes are 24 bits, which split cleanly into four 6-bit chunks, producing exactly four output characters. That 3-to-4 ratio is the fundamental constant.
A Worked Example
Encode the three ASCII bytes Man (0x4D 0x61 0x6E):
Bytes: 01001101 01100001 01101110
Regroup: 010011 010110 000101 101110
Decimal: 19 22 5 46
Alphabet: T W F u
Output: "TWFu" Three bytes in, four characters out. No padding needed. If you encode only two of those
bytes (Ma → 0x4D 0x61), the encoder zero-pads the bit stream up to
a multiple of 6 bits, emits the three meaningful characters, and appends a single = so the output length is still a multiple of 4. One byte of input yields two
characters of data plus two = pads.
Where the 33% Overhead Comes From
Every 3 input bytes become 4 output bytes (the output is ASCII, so 1 character = 1 byte on
the wire). The expansion factor is exactly 4 / 3 ≈ 1.3333, which is the
oft-quoted "33% overhead". Padding can add up to 2 more characters at the very end, so the
worst-case ratio for a tiny input is much higher:
- 1 byte input → 4 chars output (including
==pad) — 300% expansion - 2 bytes input → 4 chars output (including
=pad) — 100% expansion - 3 bytes input → 4 chars output, no padding — 33% expansion
- 1,000,000 bytes input → 1,333,336 chars output — 33.3% expansion
For anything larger than a few hundred bytes the constant settles at 33%, which is the number you should plug into capacity estimates. A 5 MB image embedded in JSON becomes a ~6.67 MB string, plus JSON escaping for any quote or backslash, plus the rest of the payload.
Padding Rules and Why They Sometimes Disappear
Padding exists so a Base64 stream can be concatenated and decoded unambiguously — the
decoder can tell from the = characters how many of the trailing bits were real. RFC
4648 §3.2 specifies the canonical encoding with padding, but it also explicitly allows omitting
padding in contexts where the length is known or the encoded string ends at a known boundary.
The two real-world places padding disappears:
- JWT: each of the three segments in a JWT is base64url-encoded without padding. The
.separator and the end-of-token are the boundaries. - Compact URLs and filenames:
=often has meaning in those contexts (URL query parameter delimiter, shell glob, etc.), so it is dropped and added back at decode time.
A robust decoder accepts both padded and unpadded input. Many standard-library decoders are
strict and require correct padding, so when you see "Invalid base64 string" errors on a JWT
segment, the first thing to try is appending = characters until the length is a multiple
of 4.
Standard Base64 vs base64url
Standard Base64 uses + and / as the last two alphabet characters. Both
are problematic outside of a strictly text context:
+has a special meaning inapplication/x-www-form-urlencoded(it decodes to a space)./is a path separator on every Unix-like filesystem and inside URL paths.
The URL-safe variant defined in RFC 4648 §5 (base64url) swaps those two characters:
+becomes-/becomes_
Everything else is identical, including the alphabet ordering and the 6-bit-to-character mapping. Crucially, standard Base64 and base64url are not interchangeable. A token encoded with base64url cannot be decoded by a strict standard Base64 decoder, and vice versa, because each will reject the other's two unique characters as invalid input.
Real Use Cases
data: URIs
Inline a small image in CSS or HTML:
background: url('data:image/png;base64,iVBORw0KGgoAAA...'); Saves an HTTP request. Cost: the asset is now 33% larger in your HTML/CSS, will be re-downloaded on every page that includes it, and cannot be cached separately by the browser. Useful for very small images (under ~1 KB) where the per-request overhead beats the inflation cost.
MIME Email Attachments
SMTP was originally designed for 7-bit ASCII. RFC 2045 specifies Base64 as the standard Content-Transfer-Encoding for binary attachments, with a hard line length of 76 characters. Every PDF, image, or Word document you have ever emailed travelled the wire in Base64, plus the 33% it adds.
HTTP Basic Authentication
The Authorization: Basic header carries base64(username + ":" + password). Note: Base64 here is just a transport
encoding, not a security measure — the credentials are recoverable in one decode step.
JWT Segments
Each of the three JWT segments — header, payload, signature — is base64url-encoded without
padding, then joined with .. The header and payload are JSON; the signature is
raw bytes from HMAC or an asymmetric algorithm. See the UUID v4 vs v7 article for related territory on encoded identifiers and how a few well-chosen bits can replace a lot
of ad-hoc encoding.
ETags and Cache Tags
Some APIs Base64-encode binary hash digests in their ETag response headers, because
the header value must be quoted ASCII. Standard practice — though hex encoding is more common
and human-readable, at the cost of an extra 50% overhead.
When NOT to Use Base64
The 33% tax is paid every single time the payload is transmitted, stored, parsed, or logged. Three categories where you should reach for a different mechanism:
Large or Streaming Binary Transfer
Uploading a 50 MB video as a Base64 string inside JSON costs you ~66 MB on the wire, plus
the JSON parser has to materialise a 66 MB string in memory before your code can decode it
back. Use multipart/form-data or a direct binary PUT with Content-Type: application/octet-stream. Both transmit the raw bytes and stream
on both ends.
Database BLOB Storage
Storing Base64 in a TEXT column wastes 33% of the storage and forces every read to decode. Use a BLOB/BYTEA column for binary data, or an external object store with the row holding only a reference URL. The only legitimate exception is when you genuinely need the column to be human-readable in CLI tools that lack hex-dump support.
High-Volume APIs
On a high-traffic API, the 33% overhead compounds across millions of requests. If the data is text already, send it as text. If it is binary and the API still demands JSON, consider compressing it first (gzip or brotli at the transport layer), and only Base64-encoding the compressed bytes when JSON inclusion is genuinely required.
Common Bugs
Forgetting Padding
Strict decoders reject input whose length is not a multiple of 4. If you stripped = for URL safety and forgot to add it back, you will see "Invalid base64
string" or similar. Fix: pad with = until length % 4 == 0.
Mixing Standard and URL-Safe Alphabets
A common bug: encode with the standard alphabet (producing + and /), place the string in a URL query parameter, forget to percent-encode, and
receive an error on decode because + has been converted to a space by the
form-decoder. Symmetric problem on the other side: decoding base64url input with a standard
decoder rejects - and _.
Double Base64
Encoding an already-encoded string adds another 33% on top of the first 33%. Most often this happens when a framework auto-Base64s a request body that already contains Base64 payloads. The symptom is a string that decodes to more Base64-looking text, with characters only from the Base64 alphabet. Walk the layer chain to find the duplicate encoder.
Case Sensitivity
The Base64 alphabet contains both A and a, which map to different
6-bit values (0 and 26). Lower-casing or upper-casing a Base64 string corrupts the data
silently. This bites in any system that historically treated text as case-insensitive —
email subject lines, DNS, certain database collations, Windows filesystem paths.
Comparison With Other Binary-to-Text Encodings
- Hexadecimal: 50% overhead (1 byte → 2 hex chars). Universal, case-insensitive in practice, trivial to read. Used for hashes, MAC addresses, terminal dumps.
- Base32 (RFC 4648 §6): ~60% overhead. Restricted alphabet, no punctuation, case-insensitive. Used in TOTP shared secrets, onion addresses, and anywhere humans must transcribe the string.
- Base64: 33% overhead. The default for almost everything.
- Base85 / Ascii85: 25% overhead. Denser, but the alphabet includes quote and backslash, so it must be carefully escaped in JSON, SQL, and shell. Common inside PDF and Adobe formats; rare elsewhere.
The trade-off is simple: smaller alphabets are safer to embed but produce more characters. Base64 is the sweet spot for "most channels accept printable ASCII", which is why it won.
JavaScript Quick Reference
The browser-built btoa and atob functions are the obvious tools,
with one big caveat: they only accept the Latin-1 character range. Anything
outside U+0000 to U+00FF throws InvalidCharacterError:
btoa('hello'); // "aGVsbG8=" — works
btoa('café'); // works in some browsers, fails in others
btoa('日本語'); // InvalidCharacterError — always For anything non-ASCII, encode to UTF-8 bytes first, then Base64 the bytes:
function toBase64(str) {
const bytes = new TextEncoder().encode(str);
let binary = '';
for (const byte of bytes) binary += String.fromCharCode(byte);
return btoa(binary);
}
function fromBase64(b64) {
const binary = atob(b64);
const bytes = new Uint8Array(binary.length);
for (let i = 0; i < binary.length; i++) bytes[i] = binary.charCodeAt(i);
return new TextDecoder().decode(bytes);
}
toBase64('日本語'); // "5pel5pys6Kqe"
fromBase64('5pel5pys6Kqe'); // "日本語" In Node.js, Buffer.from(str, 'utf8').toString('base64') is the one-liner. On
the decode side, Buffer.from(b64, 'base64').toString('utf8'). Both handle
Unicode correctly out of the box because Buffer is byte-oriented.
A Practical Sanity Check
When in doubt, the Base64 Encoder on this site round-trips text and binary through both the standard and URL-safe variants, shows you the exact output length, and surfaces invalid input gracefully. If you are debugging a JWT, a data: URI, or a misbehaving API payload, paste it in — the encoder is the fastest path to confirm whether the issue is in the encoding or somewhere else entirely.
For broader fundamentals on debugging encoded data, the top JSON errors writeup covers neighbouring ground: malformed escapes, control characters, and the small pile of edge cases that surface only when binary content meets a text format.
