Why does Base64 add 33% to file size?

Base64 maps every 3 bytes of binary input (24 bits) into 4 printable ASCII characters (each carrying only 6 bits of information). The ratio 4/3 = 1.333 means the encoded output is always 33.3% larger than the source, before counting padding. Padding with "=" can add up to 2 more characters at the end, so a 1-byte input encodes to 4 characters — a 300% expansion in the smallest case.

What is the difference between Base64 and base64url?

Standard Base64 (RFC 4648 §4) uses "+" and "/" as the last two alphabet characters, both of which have meaning inside URLs and filenames. base64url (RFC 4648 §5) replaces "+" with "-" and "/" with "_", and typically omits the "=" padding, making the output safe to drop into a URL path, query string, or filename without further percent-encoding. JWT segments are encoded with base64url.

Can I Base64-encode Unicode text in JavaScript?

Not directly with btoa(). btoa() only accepts characters in the Latin-1 range (U+0000 to U+00FF) and throws InvalidCharacterError on anything else, including emoji and most non-Latin scripts. The correct approach is to encode the string to a UTF-8 byte sequence first using TextEncoder, then convert the resulting Uint8Array to Base64. The reverse pairs TextDecoder with a base64-to-bytes step.

When should I avoid Base64 in APIs?

Avoid Base64 for large or high-throughput binary payloads. The 33% size overhead translates directly into more bandwidth, more storage, and more JSON parser work on every request. Prefer multipart/form-data uploads for files, raw binary bodies for blob transfer, and database BLOB or file-storage references over inline Base64 in JSON. Base64 is for cases where the transport layer is genuinely restricted to text — email bodies, data: URIs, JWT segments, or hand-edited config files.

Is Base64 encryption?

No. Base64 is an encoding, not encryption — it provides no confidentiality whatsoever. The transformation is deterministic, the alphabet is public, and any decoder reverses it instantly. Storing a password or API token "in Base64" is equivalent to storing it in plaintext, just slightly harder to read at a glance. Use real cryptography (AES-GCM, libsodium, or your platform’s secret manager) for any actual secret.

Base64 33% Size Overhead Explained: RFC 4648 Formula & When It Matters

Base64 is everywhere in modern software: embedded images in CSS, JWT tokens, the Authorization: Basic header, MIME email attachments, signed cookies, even the occasional ETag. It is the de-facto answer to one specific problem — moving arbitrary binary bytes through a channel that was designed for printable text — and it is wildly overused for problems it does not solve.

This post explains exactly what Base64 does at the byte level, where the famous 33% overhead comes from, what changes in the URL-safe variant, and when reaching for Base64 is the wrong call. By the end you should be able to tell, at a glance, whether a given payload belongs in Base64 at all.

What Base64 Actually Does

Base64 is defined in RFC 4648 (which obsoleted the older RFC 3548 and the even older MIME definition in RFC 2045). The mechanism is mechanical:

Take the input as a stream of 8-bit bytes.
Regroup the bits into 6-bit chunks.
Map each 6-bit value (0–63) to one character in the alphabet A–Z a–z 0–9 + /.
If the input length is not a multiple of 3, pad the output to a multiple of 4 with = characters.

The arithmetic is the source of every property people care about. Three input bytes are 24 bits, which split cleanly into four 6-bit chunks, producing exactly four output characters. That 3-to-4 ratio is the fundamental constant.

A Worked Example

Encode the three ASCII bytes Man (0x4D 0x61 0x6E):

Bytes:    01001101  01100001  01101110
Regroup:  010011  010110  000101  101110
Decimal:    19      22       5      46
Alphabet:   T       W       F       u
Output: "TWFu"

Three bytes in, four characters out. No padding needed. If you encode only two of those bytes (Ma → 0x4D 0x61), the encoder zero-pads the bit stream up to a multiple of 6 bits, emits the three meaningful characters, and appends a single = so the output length is still a multiple of 4. One byte of input yields two characters of data plus two = pads.

Where the 33% Overhead Comes From

Every 3 input bytes become 4 output bytes (the output is ASCII, so 1 character = 1 byte on the wire). The expansion factor is exactly 4 / 3 ≈ 1.3333, which is the oft-quoted "33% overhead". Padding can add up to 2 more characters at the very end, so the worst-case ratio for a tiny input is much higher:

1 byte input → 4 chars output (including == pad) — 300% expansion
2 bytes input → 4 chars output (including = pad) — 100% expansion
3 bytes input → 4 chars output, no padding — 33% expansion
1,000,000 bytes input → 1,333,336 chars output — 33.3% expansion

For anything larger than a few hundred bytes the constant settles at 33%, which is the number you should plug into capacity estimates. A 5 MB image embedded in JSON becomes a ~6.67 MB string, plus JSON escaping for any quote or backslash, plus the rest of the payload.

Padding Rules and Why They Sometimes Disappear

Padding exists so a Base64 stream can be concatenated and decoded unambiguously — the decoder can tell from the = characters how many of the trailing bits were real. RFC 4648 §3.2 specifies the canonical encoding with padding, but it also explicitly allows omitting padding in contexts where the length is known or the encoded string ends at a known boundary.

The two real-world places padding disappears:

JWT: each of the three segments in a JWT is base64url-encoded without padding. The . separator and the end-of-token are the boundaries.
Compact URLs and filenames: = often has meaning in those contexts (URL query parameter delimiter, shell glob, etc.), so it is dropped and added back at decode time.

A robust decoder accepts both padded and unpadded input. Many standard-library decoders are strict and require correct padding, so when you see "Invalid base64 string" errors on a JWT segment, the first thing to try is appending = characters until the length is a multiple of 4.

Standard Base64 vs base64url

Standard Base64 uses + and / as the last two alphabet characters. Both are problematic outside of a strictly text context:

+ has a special meaning in application/x-www-form-urlencoded (it decodes to a space).
/ is a path separator on every Unix-like filesystem and inside URL paths.

The URL-safe variant defined in RFC 4648 §5 (base64url) swaps those two characters:

+ becomes -
/ becomes _

Everything else is identical, including the alphabet ordering and the 6-bit-to-character mapping. Crucially, standard Base64 and base64url are not interchangeable. A token encoded with base64url cannot be decoded by a strict standard Base64 decoder, and vice versa, because each will reject the other's two unique characters as invalid input.

Real Use Cases

data: URIs

Inline a small image in CSS or HTML:

background: url('data:image/png;base64,iVBORw0KGgoAAA...');

Saves an HTTP request. Cost: the asset is now 33% larger in your HTML/CSS, will be re-downloaded on every page that includes it, and cannot be cached separately by the browser. Useful for very small images (under ~1 KB) where the per-request overhead beats the inflation cost.

MIME Email Attachments

SMTP was originally designed for 7-bit ASCII. RFC 2045 specifies Base64 as the standard Content-Transfer-Encoding for binary attachments, with a hard line length of 76 characters. Every PDF, image, or Word document you have ever emailed travelled the wire in Base64, plus the 33% it adds.

HTTP Basic Authentication

The Authorization: Basic header carries base64(username + ":" + password). Note: Base64 here is just a transport encoding, not a security measure — the credentials are recoverable in one decode step.

JWT Segments

Each of the three JWT segments — header, payload, signature — is base64url-encoded without padding, then joined with .. The header and payload are JSON; the signature is raw bytes from HMAC or an asymmetric algorithm. See the UUID v4 vs v7 article for related territory on encoded identifiers and how a few well-chosen bits can replace a lot of ad-hoc encoding.

ETags and Cache Tags

Some APIs Base64-encode binary hash digests in their ETag response headers, because the header value must be quoted ASCII. Standard practice — though hex encoding is more common and human-readable, at the cost of an extra 50% overhead.

When NOT to Use Base64

The 33% tax is paid every single time the payload is transmitted, stored, parsed, or logged. Three categories where you should reach for a different mechanism:

Large or Streaming Binary Transfer

Uploading a 50 MB video as a Base64 string inside JSON costs you ~66 MB on the wire, plus the JSON parser has to materialise a 66 MB string in memory before your code can decode it back. Use multipart/form-data or a direct binary PUT with Content-Type: application/octet-stream. Both transmit the raw bytes and stream on both ends.

Database BLOB Storage

Storing Base64 in a TEXT column wastes 33% of the storage and forces every read to decode. Use a BLOB/BYTEA column for binary data, or an external object store with the row holding only a reference URL. The only legitimate exception is when you genuinely need the column to be human-readable in CLI tools that lack hex-dump support.

High-Volume APIs

On a high-traffic API, the 33% overhead compounds across millions of requests. If the data is text already, send it as text. If it is binary and the API still demands JSON, consider compressing it first (gzip or brotli at the transport layer), and only Base64-encoding the compressed bytes when JSON inclusion is genuinely required.

Common Bugs

Forgetting Padding

Strict decoders reject input whose length is not a multiple of 4. If you stripped = for URL safety and forgot to add it back, you will see "Invalid base64 string" or similar. Fix: pad with = until length % 4 == 0.

Mixing Standard and URL-Safe Alphabets

A common bug: encode with the standard alphabet (producing + and /), place the string in a URL query parameter, forget to percent-encode, and receive an error on decode because + has been converted to a space by the form-decoder. Symmetric problem on the other side: decoding base64url input with a standard decoder rejects - and _.

Double Base64

Encoding an already-encoded string adds another 33% on top of the first 33%. Most often this happens when a framework auto-Base64s a request body that already contains Base64 payloads. The symptom is a string that decodes to more Base64-looking text, with characters only from the Base64 alphabet. Walk the layer chain to find the duplicate encoder.

Case Sensitivity

The Base64 alphabet contains both A and a, which map to different 6-bit values (0 and 26). Lower-casing or upper-casing a Base64 string corrupts the data silently. This bites in any system that historically treated text as case-insensitive — email subject lines, DNS, certain database collations, Windows filesystem paths.

Comparison With Other Binary-to-Text Encodings

Horizontal bar chart comparing size overhead of four binary-to-text encodings: Base85 adds 25%, Base64 adds 33%, Hexadecimal adds 50%, and Base32 adds 60%. Base64 offers the best balance of density and universal compatibility. — Size overhead of binary-to-text encodings — Base64's 33% is the sweet spot between density and universal support. Source: RFC 4648.

Hexadecimal: 50% overhead (1 byte → 2 hex chars). Universal, case-insensitive in practice, trivial to read. Used for hashes, MAC addresses, terminal dumps.
Base32 (RFC 4648 §6): ~60% overhead. Restricted alphabet, no punctuation, case-insensitive. Used in TOTP shared secrets, onion addresses, and anywhere humans must transcribe the string.
Base64: 33% overhead. The default for almost everything.
Base85 / Ascii85: 25% overhead. Denser, but the alphabet includes quote and backslash, so it must be carefully escaped in JSON, SQL, and shell. Common inside PDF and Adobe formats; rare elsewhere.

The trade-off is simple: smaller alphabets are safer to embed but produce more characters. Base64 is the sweet spot for "most channels accept printable ASCII", which is why it won.

JavaScript Quick Reference

The browser-built btoa and atob functions are the obvious tools, with one big caveat: they only accept the Latin-1 character range. Anything outside U+0000 to U+00FF throws InvalidCharacterError:

btoa('hello');         // "aGVsbG8=" — works
btoa('café');          // works in some browsers, fails in others
btoa('日本語');         // InvalidCharacterError — always

For anything non-ASCII, encode to UTF-8 bytes first, then Base64 the bytes:

function toBase64(str) {
    const bytes = new TextEncoder().encode(str);
    let binary = '';
    for (const byte of bytes) binary += String.fromCharCode(byte);
    return btoa(binary);
}

function fromBase64(b64) {
    const binary = atob(b64);
    const bytes = new Uint8Array(binary.length);
    for (let i = 0; i < binary.length; i++) bytes[i] = binary.charCodeAt(i);
    return new TextDecoder().decode(bytes);
}

toBase64('日本語');       // "5pel5pys6Kqe"
fromBase64('5pel5pys6Kqe'); // "日本語"

In Node.js, Buffer.from(str, 'utf8').toString('base64') is the one-liner. On the decode side, Buffer.from(b64, 'base64').toString('utf8'). Both handle Unicode correctly out of the box because Buffer is byte-oriented.

A Practical Sanity Check

When in doubt, the Base64 Encoder on this site round-trips text and binary through both the standard and URL-safe variants, shows you the exact output length, and surfaces invalid input gracefully. If you are debugging a JWT, a data: URI, or a misbehaving API payload, paste it in — the encoder is the fastest path to confirm whether the issue is in the encoding or somewhere else entirely.

For broader fundamentals on debugging encoded data, the top JSON errors writeup covers neighbouring ground: malformed escapes, control characters, and the small pile of edge cases that surface only when binary content meets a text format. Once you have decoded a Base64 payload back to JSON, you can paste it directly into the JSON Formatter to validate structure and spot malformed keys or missing brackets before passing the data further downstream.