You copy text from a document, a website, or a chat message — and paste it into your code, a spreadsheet, or a search field. Everything looks normal. But the string comparison fails. The search returns no results. The CSV row splits at the wrong column. Something invisible crept in with your clipboard, and now your data is silently broken.
Unicode contains dozens of characters specifically designed to be invisible or to control text flow. Most of the time they serve a legitimate purpose — supporting right-to-left scripts, controlling how emoji combine, marking word boundaries in languages without spaces. But when these characters escape into contexts that don't expect them, they cause hard-to-diagnose bugs. Here are the ten offenders you're most likely to encounter and exactly what they do to your data.
The 10 Hidden Characters
1. Zero-Width Space — U+200B
What it is: A space character with zero width. It is completely invisible in rendered text but present as a real code point in the string.
Where it comes from: Content management systems, word processors, and websites that want to suggest line-break opportunities inside long words or URLs without actually inserting a visible space.
What it breaks: String equality, word counting, search indexing, and any regular expression that matches word boundaries. Two strings that look identical in the UI can differ because one contains a U+200B in the middle.
"helloworld" === "helloworld" // false — invisible U+200B after 'o'
"helloworld".length // 11, not 10 2. Zero-Width Non-Joiner — U+200C
What it is: Instructs a rendering engine not to join adjacent characters that would normally connect (for example, cursive ligatures in Arabic or Devanagari).
Where it comes from: Persian and Arabic text editors insert it routinely to control script rendering. It leaks into English copy when editors paste multilingual content without cleaning it first.
What it breaks: Token parsing, slug generation, and any system that splits on whitespace. Because U+200C is not whitespace, it will end up embedded inside apparent words.
slugify("designer tips")
// expected: "designer-tips"
// actual: "designer-tips" (ZWNJ survives the slug) 3. Zero-Width Joiner — U+200D
What it is: The opposite of U+200C. It asks the renderer to join adjacent
characters. It is the glue that combines emoji into sequences: the family emoji 👨👩👧 is actually three separate emoji joined by two U+200D characters.
Where it comes from: Any source of emoji, especially copy-pasted from mobile keyboards or social media.
What it breaks: When a stray U+200D appears outside an intentional emoji sequence, it joins characters that were never meant to connect, producing garbled glyphs in some rendering engines and unexpected string lengths everywhere.
"cafe\u200D".length // 5, not 4
Array.from("cafe\u200D") // ["c","a","f","e","\u200D"] 4. Soft Hyphen — U+00AD
What it is: A hyphenation hint. Browsers render it as invisible unless the word breaks at that exact position, in which case a visible hyphen appears at the line end.
Where it comes from: Automated hyphenation tools, some word processors (especially when exporting to HTML), and older desktop publishing systems.
What it breaks: Database storage and retrieval, full-text search, and
clipboard comparisons. A word like pressure stored in a database will not
match the search query pressure because the soft hyphen is a real byte.
"pres\u00ADsure" === "pressure" // false
"pres\u00ADsure".length // 9, not 8 5. Non-Breaking Space — U+00A0
What it is: Looks exactly like a regular space (U+0020) but prevents a line break between the two words it separates. Widely used in typography to keep units with their numbers: 42 kg, § 4.
Where it comes from: Word processors, Google Docs, Wikipedia, and any professionally typeset web page. It is one of the most common invisible character problems because it is so frequently and intentionally used.
What it breaks: Code that splits on whitespace using a simple \s or space character literal may miss U+00A0 entirely. SQL TRIM() does not strip it in most databases. Python's str.split() handles it, but str.strip() does not without explicit handling.
// Python
"hello\u00A0world".split() # ['hello', 'world'] ✓
"hello\u00A0world".strip() # 'hello\u00A0world' ✗ — NBSP survives
// JavaScript
/^\s+$/.test("\u00A0") // true in most engines — but not all
"\u00A0" === " " // false 6. Left-to-Right Mark and Right-to-Left Mark — U+200E / U+200F
What they are: Invisible directional controls for the Unicode Bidirectional Algorithm. LRM (U+200E) pushes the surrounding text into left-to-right rendering; RLM (U+200F) does the opposite. Neither has width.
Where they come from: Any application that handles mixed-direction text — Hebrew or Arabic mixed with English, spreadsheets with RTL locales, customer-facing platforms localized for Middle Eastern markets.
What they break: String comparisons, log parsing, and any tool that processes text without stripping bidi control characters first. They are especially sneaky because pasting into a plain-text editor often makes them invisible even to hex viewers that render them as blank.
const a = "status\u200E";
const b = "status";
a === b // false
a.length // 7 7. Word Joiner — U+2060
What it is: Functionally similar to U+200B but with the opposite intent: it prevents a line break at that position without adding any visible space. It is the modern replacement for the deprecated U+FEFF used in this role.
Where it comes from: Desktop publishing and word-processing software that needs to keep certain word pairs on the same line (product names, abbreviations) without altering the visible typography.
What it breaks: Like all zero-width characters, it corrupts string equality checks and increases string lengths unexpectedly. It also breaks tokenizers that split text into words for NLP or search indexing.
"donot".length // 7, not 6
"donot" === "donot" // false 8. Byte Order Mark — U+FEFF
What it is: Originally used at the very start of a UTF-16 or UTF-32 file to indicate byte order (big-endian vs. little-endian). In UTF-8 files it is technically redundant but some Windows tools (Notepad, Excel) still write it.
Where it comes from: Any file saved by a Windows application with "UTF-8 with BOM" encoding. Very common with CSV exports from Excel and text files from legacy Windows toolchains.
What it breaks: JSON parsers reject files that begin with a BOM (RFC 8259
forbids it). CSV parsers include the BOM in the first column header name, so id becomes id and all lookups by column name fail silently. Shell scripts
that start with a BOM will fail to execute.
// Node.js reading a BOM-prefixed CSV
const firstKey = Object.keys(rows[0])[0];
firstKey === "id" // false — it's "\uFEFFid"
firstKey.charCodeAt(0) // 65279 (0xFEFF) 9. Object Replacement Character — U+FFFC
What it is: A placeholder used in rich text formats to mark the position of an embedded object — an image, a table, an inline attachment — that has no textual representation.
Where it comes from: Rich text editors (Google Docs, LibreOffice, Word) and messaging platforms that support inline file attachments. When you copy a paragraph that contains an embedded image, U+FFFC travels with the text even if the image does not.
What it breaks: Plain-text pipelines, character counters, and any form validation that enforces a maximum length. A comment that looks like 140 characters may actually be 141 because of a stray object replacement character copied from a rich text message.
const text = "See attached\uFFFCfor details";
text.length // 26 — one extra invisible character 10. Replacement Character — U+FFFD
What it is: The Unicode standard's official substitute for a byte sequence
that cannot be decoded. When a decoder encounters bytes that do not map to any valid code
point in the target encoding, it replaces them with U+FFFD (rendered as �).
Where it comes from: Encoding mismatches — reading a Latin-1 or Windows-1252 file as UTF-8, or re-encoding a string through multiple steps without specifying the encoding explicitly. Also appears when scraping web pages that report the wrong charset in their headers.
What it breaks: Unlike the other characters on this list, U+FFFD is visible
(as �) if rendered, but it is often ignored because developers assume it
is intentional. In data pipelines it signals lost information: the original bytes are gone.
Storing U+FFFD in a database means the original data cannot be recovered.
// Python — reading a Latin-1 file as UTF-8
text = open("file.txt", encoding="utf-8", errors="replace").read()
# Any invalid bytes become U+FFFD ('\ufffd') Why These Characters Are So Hard to Find
The fundamental problem is that most text-editing interfaces render these characters as nothing at all. A zero-width space occupies no pixels. A non-breaking space looks identical to a regular space. A byte order mark at the start of a string is invisible unless you explicitly inspect the raw bytes.
Standard developer tools do not help much either. Browser DevTools show the rendered text,
not the code points. Most code editors display these characters as blank unless you
specifically configure them to show invisible characters. The JavaScript console.log() output looks clean. Only a hex dump or a dedicated Unicode inspector
reveals what is really there.
The result is a category of bugs that are genuinely difficult to reproduce from a bug report, because the bug only exists in the specific string that was pasted — a string that looks completely normal to anyone reading it.
How to Detect Hidden Characters
The most reliable way to check a string for hidden characters is to inspect it at the code
point level. In JavaScript, Array.from(str) splits a string by Unicode code points
(correctly handling surrogate pairs), and you can filter for anything outside the printable ASCII
range:
const suspicious = Array.from(str).filter(ch => {
const cp = ch.codePointAt(0);
return cp < 0x20 || (cp >= 0x7F && cp <= 0x9F)
|| cp === 0x00AD || cp === 0x200B || cp === 0x200C
|| cp === 0x200D || cp === 0x200E || cp === 0x200F
|| cp === 0x2060 || cp === 0xFEFF || cp === 0xFFFC
|| cp === 0xFFFD;
});
// e.g. "U+200B", "U+FEFF"
console.log(suspicious.map(ch => "U+" + ch.codePointAt(0).toString(16).toUpperCase().padStart(4, "0"))); In Python, you can use the unicodedata module to inspect the category of each character:
import unicodedata
def find_hidden(text):
return [
(i, unicodedata.name(ch, 'UNKNOWN'), hex(ord(ch)))
for i, ch in enumerate(text)
if unicodedata.category(ch) in ('Cf', 'Cc', 'Cs')
] The category Cf covers format characters (zero-width spaces, bidi marks, soft
hyphens), Cc covers control characters, and Cs covers surrogates. Together
they catch almost every invisible troublemaker on this list.
How to Remove Them
Once detected, removal is straightforward. A targeted regular expression handles the most common offenders:
// JavaScript — remove common invisible characters
const cleaned = str.replace(
/[\u00AD\u200B-\u200F\u2060\uFEFF\uFFFC\uFFFD]/g,
''
);
// Replace NBSP with a regular space instead of deleting it
const normalized = cleaned.replace(/\u00A0/g, ' '); For production use, however, writing and maintaining a comprehensive regex is tedious and error-prone — the Unicode standard adds new format characters across versions, and edge cases like partial emoji sequences require careful handling.
A dedicated tool is a better choice for one-off cleaning tasks, especially when working with text from untrusted sources like user submissions, scraped content, or document exports. The Smart Text Cleaner handles all ten characters described in this article, normalizes whitespace, and gives you a preview of exactly what was removed before you commit to the change.
Prevention Is Better Than Cleaning
For long-term reliability, it is worth building invisible-character stripping into the data entry layer rather than cleaning after the fact. Sanitize user-submitted text at the API boundary before it is stored. Add a lint rule or pre-commit hook that fails when source files contain zero-width characters. Configure your editor to render invisible characters visibly — VS Code's "Render Whitespace" setting covers some of these, and extensions like Highlight Bad Chars cover the rest.
When your team regularly copies content from rich-text sources — Notion, Google Docs, Confluence, Slack — a short paste-cleanup habit saves significant debugging time. Paste into a plain-text intermediary first, or use a tool that strips formatting on paste. The ten characters described here are the most common culprits, but the habit of treating clipboard content as untrusted input is the real lesson.