The same text string โ say, "cafรฉ ๐" โ reports six characters in Python 3, seven in JavaScript, five in a grapheme-aware tool, and eight bytes in UTF-8. Before concluding that one of these tools is wrong, it helps to understand what each one is actually measuring. The disagreement is not a bug; it reflects four genuinely different definitions of "character" that coexist in modern computing.
The confusion matters in practice. An SMS that appears to fit in 160 characters on your phone breaks into a two-part message on a carrier that counts bytes differently. A tweet that looks like 278 characters in your editor gets flagged as over-limit by the Twitter API. A database VARCHAR(255) silently truncates a string that contains emoji. Each failure traces back to a mismatch between what you thought you were counting and what the system is actually counting.
What Is a "Character" Anyway?
Bytes: The Raw Storage Unit
Bytes are the most concrete unit. A byte is eight bits. In UTF-8 encoding, ASCII characters (AโZ, 0โ9, punctuation) are one byte each, characters in Latin Extended and most European scripts are two bytes, CJK characters are three bytes, and emoji and supplementary plane characters are four bytes. When a server returns a Content-Length header or a database column stores VARCHAR data in a binary collation, bytes are what it counts.
// Node.js
Buffer.byteLength("cafรฉ", "utf8") // 5 โ 'รฉ' is 2 bytes
Buffer.byteLength("A", "utf8") // 1
Buffer.byteLength("๐", "utf8") // 4 โ emoji is 4 bytes Code Points: Unicode's Universal Identifiers
Unicode assigns every character โ every letter, emoji, mathematical symbol, and control character โ a unique number called a code point. Code points are written as U+XXXX (e.g., U+00E9 for "รฉ", U+1F389 for "๐"). There are currently 149,813 assigned code points across 17 planes.
Python 3's len() counts code points. This is the most linguistically intuitive unit
for most text processing tasks, because it matches the "number of Unicode characters" that most
people mean when they say "character count".
# Python 3
len("cafรฉ") # 4 โ 'รฉ' is U+00E9, one code point
len("๐") # 1 โ emoji is U+1F389, one code point
len("cafรฉ ๐") # 6 Code Units: The UTF-16 Complication
JavaScript, Java, and C# internally represent strings as UTF-16. In UTF-16, characters in the Basic Multilingual Plane (U+0000 to U+FFFF) take one 16-bit code unit. Characters above U+FFFF โ including most emoji โ require two code units called a surrogate pair.
JavaScript's .length property counts UTF-16 code units, not code points. This is
why emoji have a length of 2 in JavaScript even though they are a single code point.
// JavaScript
"cafรฉ".length // 4
"๐".length // 2 โ surrogate pair
"cafรฉ ๐".length // 7, not 6
// To count code points in JavaScript:
[..."cafรฉ ๐"].length // 6 Grapheme Clusters: What Humans Actually See
A grapheme cluster is what a human perceives as a single character on screen. A family emoji like ๐จโ๐ฉโ๐ง is one visible character but consists of three emoji (U+1F468, U+200D, U+1F469, U+200D, U+1F467) joined by two Zero-Width Joiner characters โ a total of five code points and ten UTF-16 code units. Grapheme cluster counting is what word processors and grapheme-aware string libraries use.
// Using Intl.Segmenter (modern JS)
const segmenter = new Intl.Segmenter();
const segments = [...segmenter.segment("๐จโ๐ฉโ๐ง")];
segments.length // 1 โ one grapheme cluster
"๐จโ๐ฉโ๐ง".length // 8 โ UTF-16 code units
[..."๐จโ๐ฉโ๐ง"].length // 5 โ code points Platform-by-Platform Breakdown
Twitter/X: 280 Code Points, With Special Rules
Twitter's character limit is based on Unicode code points, but with exceptions: every URL (regardless of actual length) counts as exactly 23 characters. Emoji count as 1 each. So "Check out https://example.com/very-long-path" consumes 12 + 1 + 23 = 36 characters by Twitter's count, not the actual 45 visible characters. This means you can paste a long URL into a tweet and have it count the same as a short one โ but only if Twitter's own shortener wraps it. Third-party scheduling tools that count naively will give you a different number than what the API ultimately enforces.
SMS: 160 Characters in GSM-7, 70 in Unicode
SMS messages use GSM-7 encoding by default, a 7-bit encoding that covers 128 characters (the basic Latin alphabet plus some punctuation and special chars). In GSM-7, a single SMS can hold 160 characters. The moment you include one character outside the GSM-7 alphabet โ a smart quote, an accented character not in the set, or any emoji โ the entire message switches to UTF-16 encoding and the single-SMS limit drops to 70 characters.
GSM-7 single SMS: 160 characters
GSM-7 multi-part: 153 chars/part (7 chars overhead per segment)
Unicode single SMS: 70 characters
Unicode multi-part: 67 chars/part The "รฉ" in "cafรฉ" is in GSM-7 (it is in the GSM extended table), but "รฑ" or any emoji is not. This is why a single accented character can suddenly halve your SMS capacity. A marketing team that drafts a campaign in a word processor and pasually uses a curly apostrophe instead of a straight one will send two-part messages and pay double the carrier cost โ for every recipient.
Python 3: len() Counts Code Points
Python 3 stores strings internally as arrays of code points (using Latin-1, UCS-2, or UCS-4
depending on the content). len() returns the number of code points. This is predictable
for most text but still different from grapheme clusters when dealing with composed characters.
One subtle trap: the same visible character can have a different code point count depending on
whether it uses precomposed or decomposed Unicode normalization.
# Python 3
s = "รฉ" # 'e' + combining acute accent (NFD form of 'รฉ')
len(s) # 2 โ two code points
import unicodedata
len(unicodedata.normalize("NFC", s)) # 1 โ composed form JavaScript: .length Counts UTF-16 Code Units
JavaScript's .length is the most common source of off-by-one errors when
handling emoji or supplementary-plane characters. The Array.from() spread trick
or the Intl.Segmenter API gives code point or grapheme cluster counts
respectively. Any validation that enforces a character limit using .length directly will reject strings that are actually under the limit โ or worse, silently accept strings
that are over it, depending on where the emoji falls.
// Safer character counting in JavaScript
function codePointCount(str) {
return [...str].length;
}
function graphemeCount(str) {
return [...new Intl.Segmenter().segment(str)].length;
}
codePointCount("cafรฉ ๐") // 6
graphemeCount("cafรฉ ๐") // 6
"cafรฉ ๐".length // 7 MySQL: VARCHAR(n) Depends on the Character Set
In MySQL, VARCHAR(255) means 255 characters, but "character" is defined by the column's
character set. With utf8mb4 (which supports full Unicode including emoji), MySQL allocates
up to four bytes per character but still counts VARCHAR in characters, not bytes. However,
utf8 (the misnamed old charset that only covers BMP characters) will reject emoji entirely
because it caps at three bytes per character. PostgreSQL's varchar(n) consistently counts Unicode code points.
The practical consequence: migrating a MySQL database from the utf8 charset to utf8mb4 is not just a storage change โ it changes which strings the column will
accept. A VARCHAR(255) column in utf8 that contains 255 CJK characters uses 765
bytes. The same column in utf8mb4 uses up to 1020 bytes, which can exceed the row
size limit for InnoDB tables with many columns.
The Emoji Problem in Detail
Emoji create the most confusion because their "size" varies across all four counting methods. A single-person emoji like ๐ (U+1F389) is:
- 4 bytes in UTF-8
- 2 code units in UTF-16 (a surrogate pair)
- 1 code point in Unicode
- 1 grapheme cluster
A family emoji like ๐จโ๐ฉโ๐ง is:
- 20 bytes in UTF-8
- 8 code units in UTF-16
- 5 code points (3 emoji + 2 ZWJ)
- 1 grapheme cluster
When a platform says an emoji "counts as 2 characters," it is using UTF-16. When it says "1 character," it is using code points or grapheme clusters. Neither answer is wrong in isolation โ but the mismatch between what you see on screen (one character) and what the API enforces (two code units) is the source of countless "why is my post over the limit" support tickets.
Skin-tone modifiers add another layer. The emoji ๐๐ฝ (thumbs up with medium skin tone) is two code points โ the base thumbs-up emoji U+1F44D and the skin tone modifier U+1F3FD โ but renders as one visible glyph. That means it is 2 code points, 4 UTF-16 code units, 8 UTF-8 bytes, and 1 grapheme cluster. A character counter that reports all four values simultaneously makes these distinctions immediately visible instead of requiring you to work through the math by hand.
Unicode Normalization Changes the Count
A single character like "รฉ" can be represented in two ways: as a precomposed character (U+00E9, NFC form) or as a base character plus a combining accent (U+0065 U+0301, NFD form). Both look identical on screen, but one has one code point and the other has two.
import unicodedata
nfc = unicodedata.normalize("NFC", "รฉ") # U+00E9
nfd = unicodedata.normalize("NFD", "รฉ") # U+0065 U+0301
len(nfc) # 1
len(nfd) # 2
nfc == nfd # False (!) โ despite looking the same This affects string equality, sort order, and character counting. Web browsers typically normalize form inputs to NFC before submitting, but text pasted from different operating systems can arrive in either form. macOS tends to use NFD for filenames; Windows and Linux use NFC. If your application receives filenames through an upload form and stores them in a database, you may end up with two rows that look identical in the UI but compare as different strings because one was uploaded from a Mac.
The Unicode Consortium recommends normalizing to NFC before storing or comparing strings.
Most language standard libraries provide a normalization function: Python's unicodedata.normalize(), JavaScript's String.prototype.normalize(), Ruby's String#unicode_normalize.
Using it consistently at the data entry boundary prevents a whole class of subtle equality
bugs.
What You Should Actually Count
The right unit depends on what you are counting for:
- Storage size: Count UTF-8 bytes. Use
Buffer.byteLength(str, 'utf8')in Node.js orlen(s.encode('utf-8'))in Python. - Social media limits: Use code points (most platforms, including Twitter for non-URL text). Verify with each platform's API documentation.
- SMS limits: Check whether your message contains non-GSM-7 characters first, then count accordingly.
- UI display space: Count grapheme clusters โ this matches what the user sees on screen.
- Database limits: Read your column definition and character set. VARCHAR(n) in utf8mb4 MySQL counts characters (code points); in a binary collation it counts bytes.
A dedicated tool that reports all four values simultaneously removes the guesswork. Paste your text and check the exact character count โ comparing the byte count, code unit count, code point count, and grapheme cluster count side by side immediately tells you which limits your text falls within and which it violates.
Understanding why character counting matters across platforms is the broader context for these discrepancies: each platform defines its own unit of measurement, and those definitions are rarely documented prominently enough for developers to find before hitting a limit in production. The patterns are consistent once you know what to look for, but discovering them after a production incident is a frustrating way to learn.