What is a zero-width space and why does it break copy-paste?

A zero-width space (U+200B) is a real Unicode code point that occupies no pixels in rendered text. Content management systems and word processors insert it to suggest line-break opportunities inside long words or URLs. When it travels through the clipboard, it silently breaks string equality, word counting, and regex word-boundary matching even though both strings look identical on screen.

Why does a CSV exported from Excel break my JSON parser or column lookups?

Windows applications often save files as "UTF-8 with BOM", prepending a byte order mark (U+FEFF) to the file. JSON parsers reject files that begin with a BOM because RFC 8259 forbids it, and CSV parsers include the BOM in the first column header so a key like "id" actually reads as "id" and every lookup by name silently fails.

Why does a non-breaking space survive trim() and split() in my code?

The non-breaking space (U+00A0) looks identical to a regular space but is a different code point. Python's str.strip() and SQL TRIM() in most databases do not remove it by default, and a literal " " in a split() call will not match it. Word processors, Google Docs, and Wikipedia insert NBSP routinely to keep numbers attached to their units, so it leaks into pasted text constantly.

How can I detect hidden Unicode characters in a string?

Inspect the string at the code-point level rather than relying on the rendered glyph. In JavaScript use Array.from(str) and filter for code points in the control or format ranges; in Python use unicodedata.category() and flag anything in categories Cf, Cc, or Cs. A character counter that reports a higher total than the visible glyph count is also a fast smoke test for the presence of an invisible code point.

What is the difference between U+FFFC and U+FFFD?

U+FFFC is the Object Replacement Character — a placeholder that rich-text editors leave behind where an embedded image, table, or attachment used to be when the text is copied without the object. U+FFFD is the Replacement Character that decoders substitute in place of bytes they cannot decode in the target encoding. The first is invisible and inflates lengths silently; the second is visible (rendered as the diamond glyph) and signals that the original bytes have been lost.

10 Invisible Characters That Silently Break Copy-Paste (Fix)

You copy text from a document, a website, or a chat message — and paste it into your code, a spreadsheet, or a search field. Everything looks normal. But the string comparison fails. The search returns no results. The CSV row splits at the wrong column. Something invisible crept in with your clipboard, and now your data is silently broken.

Unicode contains dozens of characters specifically designed to be invisible or to control text flow. Most of the time they serve a legitimate purpose — supporting right-to-left scripts, controlling how emoji combine, marking word boundaries in languages without spaces. But when these characters escape into contexts that don't expect them, they cause hard-to-diagnose bugs. Here are the ten offenders you're most likely to encounter and exactly what they do to your data.

The 10 Hidden Characters

1. Zero-Width Space — U+200B

What it is: A space character with zero width. It is completely invisible in rendered text but present as a real code point in the string.

Where it comes from: Content management systems, word processors, and websites that want to suggest line-break opportunities inside long words or URLs without actually inserting a visible space.

What it breaks: String equality, word counting, search indexing, and any regular expression that matches word boundaries. Two strings that look identical in the UI can differ because one contains a U+200B in the middle.

"helloworld" === "helloworld"  // false — invisible U+200B after 'o'
"helloworld".length  // 11, not 10

If you suspect a string is hiding a U+200B, paste it into a character counter — the reported length will exceed the visible glyph count, immediately confirming the presence of a hidden code point.

2. Zero-Width Non-Joiner — U+200C

What it is: Instructs a rendering engine not to join adjacent characters that would normally connect (for example, cursive ligatures in Arabic or Devanagari).

Where it comes from: Persian and Arabic text editors insert it routinely to control script rendering. It leaks into English copy when editors paste multilingual content without cleaning it first.

What it breaks: Token parsing, slug generation, and any system that splits on whitespace. Because U+200C is not whitespace, it will end up embedded inside apparent words.

slugify("design‌er tips")
// expected: "designer-tips"
// actual:   "designer‌-tips"  (ZWNJ survives the slug)

The same problem appears with code identifiers: a ZWNJ embedded in a variable name survives camelCase ↔ snake_case conversion, producing a corrupted identifier that looks correct but fails equality checks.

3. Zero-Width Joiner — U+200D

What it is: The opposite of U+200C. It asks the renderer to join adjacent characters. It is the glue that combines emoji into sequences: the family emoji 👨‍👩‍👧 is actually three separate emoji joined by two U+200D characters.

Where it comes from: Any source of emoji, especially copy-pasted from mobile keyboards or social media.

What it breaks: When a stray U+200D appears outside an intentional emoji sequence, it joins characters that were never meant to connect, producing garbled glyphs in some rendering engines and unexpected string lengths everywhere.

"cafe\u200D".length  // 5, not 4
Array.from("cafe\u200D")  // ["c","a","f","e","\u200D"]

4. Soft Hyphen — U+00AD

What it is: A hyphenation hint. Browsers render it as invisible unless the word breaks at that exact position, in which case a visible hyphen appears at the line end.

Where it comes from: Automated hyphenation tools, some word processors (especially when exporting to HTML), and older desktop publishing systems.

What it breaks: Database storage and retrieval, full-text search, and clipboard comparisons. A word like pressure stored in a database will not match the search query pressure because the soft hyphen is a real byte.

"pres\u00ADsure" === "pressure"  // false
"pres\u00ADsure".length  // 9, not 8

5. Non-Breaking Space — U+00A0

What it is: Looks exactly like a regular space (U+0020) but prevents a line break between the two words it separates. Widely used in typography to keep units with their numbers: 42 kg, § 4.

Where it comes from: Word processors, Google Docs, Wikipedia, and any professionally typeset web page. It is one of the most common invisible character problems because it is so frequently and intentionally used.

What it breaks: Code that splits on whitespace using a simple \s or space character literal may miss U+00A0 entirely. SQL TRIM() does not strip it in most databases. Python's str.split() handles it, but str.strip() does not without explicit handling.

// Python
"hello\u00A0world".split()     # ['hello', 'world']  ✓
"hello\u00A0world".strip()     # 'hello\u00A0world'  ✗ — NBSP survives

// JavaScript
/^\s+$/.test("\u00A0")  // true in most engines — but not all
"\u00A0" === " "         // false

6. Left-to-Right Mark and Right-to-Left Mark — U+200E / U+200F

What they are: Invisible directional controls for the Unicode Bidirectional Algorithm. LRM (U+200E) pushes the surrounding text into left-to-right rendering; RLM (U+200F) does the opposite. Neither has width.

Where they come from: Any application that handles mixed-direction text — Hebrew or Arabic mixed with English, spreadsheets with RTL locales, customer-facing platforms localized for Middle Eastern markets.

What they break: String comparisons, log parsing, and any tool that processes text without stripping bidi control characters first. They are especially sneaky because pasting into a plain-text editor often makes them invisible even to hex viewers that render them as blank.

const a = "status\u200E";
const b = "status";
a === b  // false
a.length // 7

7. Word Joiner — U+2060

What it is: Functionally similar to U+200B but with the opposite intent: it prevents a line break at that position without adding any visible space. It is the modern replacement for the deprecated U+FEFF used in this role.

Where it comes from: Desktop publishing and word-processing software that needs to keep certain word pairs on the same line (product names, abbreviations) without altering the visible typography.

What it breaks: Like all zero-width characters, it corrupts string equality checks and increases string lengths unexpectedly. It also breaks tokenizers that split text into words for NLP or search indexing.

"do⁠not".length      // 7, not 6
"do⁠not" === "donot"  // false

8. Byte Order Mark — U+FEFF

What it is: Originally used at the very start of a UTF-16 or UTF-32 file to indicate byte order (big-endian vs. little-endian). In UTF-8 files it is technically redundant but some Windows tools (Notepad, Excel) still write it.

Where it comes from: Any file saved by a Windows application with "UTF-8 with BOM" encoding. Very common with CSV exports from Excel and text files from legacy Windows toolchains.

What it breaks: JSON parsers reject files that begin with a BOM (RFC 8259 forbids it). CSV parsers include the BOM in the first column header name, so id becomes id and all lookups by column name fail silently. Shell scripts that start with a BOM will fail to execute.

// Node.js reading a BOM-prefixed CSV
const firstKey = Object.keys(rows[0])[0];
firstKey === "id"     // false — it's "\uFEFFid"
firstKey.charCodeAt(0)  // 65279 (0xFEFF)

Pasting the affected string into a character counter instantly reveals the discrepancy: the character total will be one higher than the number of visible glyphs, pointing directly to the hidden BOM.

9. Object Replacement Character — U+FFFC

What it is: A placeholder used in rich text formats to mark the position of an embedded object — an image, a table, an inline attachment — that has no textual representation.

Where it comes from: Rich text editors (Google Docs, LibreOffice, Word) and messaging platforms that support inline file attachments. When you copy a paragraph that contains an embedded image, U+FFFC travels with the text even if the image does not.

What it breaks: Plain-text pipelines, character counters, and any form validation that enforces a maximum length. A comment that looks like 140 characters may actually be 141 because of a stray object replacement character copied from a rich text message.

const text = "See attached\uFFFCfor details";
text.length  // 26 — one extra invisible character

10. Replacement Character — U+FFFD

What it is: The Unicode standard's official substitute for a byte sequence that cannot be decoded. When a decoder encounters bytes that do not map to any valid code point in the target encoding, it replaces them with U+FFFD (rendered as �).

Where it comes from: Encoding mismatches — reading a Latin-1 or Windows-1252 file as UTF-8, or re-encoding a string through multiple steps without specifying the encoding explicitly. Also appears when scraping web pages that report the wrong charset in their headers.

What it breaks: Unlike the other characters on this list, U+FFFD is visible (as �) if rendered, but it is often ignored because developers assume it is intentional. In data pipelines it signals lost information: the original bytes are gone. Storing U+FFFD in a database means the original data cannot be recovered.

// Python — reading a Latin-1 file as UTF-8
text = open("file.txt", encoding="utf-8", errors="replace").read()
# Any invalid bytes become U+FFFD ('\ufffd')

Why These Characters Are So Hard to Find

The fundamental problem is that most text-editing interfaces render these characters as nothing at all. A zero-width space occupies no pixels. A non-breaking space looks identical to a regular space. A byte order mark at the start of a string is invisible unless you explicitly inspect the raw bytes.

Standard developer tools do not help much either. Browser DevTools show the rendered text, not the code points. Most code editors display these characters as blank unless you specifically configure them to show invisible characters. The JavaScript console.log() output looks clean. Only a hex dump or a dedicated Unicode inspector reveals what is really there.

The result is a category of bugs that are genuinely difficult to reproduce from a bug report, because the bug only exists in the specific string that was pasted — a string that looks completely normal to anyone reading it.

How to Detect Hidden Characters

The most reliable way to check a string for hidden characters is to inspect it at the code point level. In JavaScript, Array.from(str) splits a string by Unicode code points (correctly handling surrogate pairs), and you can filter for anything outside the printable ASCII range:

const suspicious = Array.from(str).filter(ch => {
    const cp = ch.codePointAt(0);
    return cp < 0x20 || (cp >= 0x7F && cp <= 0x9F)
        || cp === 0x00AD || cp === 0x200B || cp === 0x200C
        || cp === 0x200D || cp === 0x200E || cp === 0x200F
        || cp === 0x2060 || cp === 0xFEFF || cp === 0xFFFC
        || cp === 0xFFFD;
});
// e.g. "U+200B", "U+FEFF"
console.log(suspicious.map(ch => "U+" + ch.codePointAt(0).toString(16).toUpperCase().padStart(4, "0")));

In Python, you can use the unicodedata module to inspect the category of each character:

import unicodedata

def find_hidden(text):
    return [
        (i, unicodedata.name(ch, 'UNKNOWN'), hex(ord(ch)))
        for i, ch in enumerate(text)
        if unicodedata.category(ch) in ('Cf', 'Cc', 'Cs')
    ]

The category Cf covers format characters (zero-width spaces, bidi marks, soft hyphens), Cc covers control characters, and Cs covers surrogates. Together they catch almost every invisible troublemaker on this list.

How to Remove Them

Once detected, removal is straightforward. A targeted regular expression handles the most common offenders:

// JavaScript — remove common invisible characters
const cleaned = str.replace(
    /[\u00AD\u200B-\u200F\u2060\uFEFF\uFFFC\uFFFD]/g,
    ''
);

// Replace NBSP with a regular space instead of deleting it
const normalized = cleaned.replace(/\u00A0/g, ' ');

For production use, however, writing and maintaining a comprehensive regex is tedious and error-prone — the Unicode standard adds new format characters across versions, and edge cases like partial emoji sequences require careful handling.

A dedicated tool is a better choice for one-off cleaning tasks, especially when working with text from untrusted sources like user submissions, scraped content, or document exports. The Smart Text Cleaner handles all ten characters described in this article, normalizes whitespace, and gives you a preview of exactly what was removed before you commit to the change.

Prevention Is Better Than Cleaning

For long-term reliability, it is worth building invisible-character stripping into the data entry layer rather than cleaning after the fact. Sanitize user-submitted text at the API boundary before it is stored. Add a lint rule or pre-commit hook that fails when source files contain zero-width characters. Configure your editor to render invisible characters visibly — VS Code's "Render Whitespace" setting covers some of these, and extensions like Highlight Bad Chars cover the rest.

If you regularly copy content from websites, consider preventing the problem at the source: the Copy Content browser extension extracts webpages as clean Markdown, stripping HTML markup and the hidden characters that ride along with it. Instead of importing a page's zero-width spaces and bidi marks into your document, Copy Content gives you structured plain text from the start.

When your team regularly copies content from rich-text sources — Notion, Google Docs, Confluence, Slack — a short paste-cleanup habit saves significant debugging time. Paste into a plain-text intermediary first, or use a tool that strips formatting on paste. The ten characters described here are the most common culprits, but the habit of treating clipboard content as untrusted input is the real lesson.