Language has a shape. The ten most common English words — the, be, to, of, and, a, in, that, have, it — make up over 25% of every text ever written in the language. The next 90 most common words bring that figure to roughly 50%. Below the top 1,000 words, each individual word becomes so rare that you might encounter it only once in a million words of reading.
The corpus data behind these figures comes from massive collections of real-world text — news articles, novels, academic papers, transcribed speech — that linguists call corpora. The numbers they reveal about word frequency are remarkably consistent across corpora and have direct applications in language learning, natural language processing, character counting, SEO, and search engine design.
What Is a Corpus?
A corpus is a large, structured collection of naturally occurring language used to study how language works in practice. Rather than relying on intuitions about what words "should" appear, corpus linguists measure what words actually appear — and how often — across millions or billions of real sentences. Frequency lists derived from corpora reflect the living language as it is used, not as grammarians prescribe it.
COCA: The Corpus of Contemporary American English
COCA (corpus.byu.edu) contains 1 billion words of American English text collected from 1990 to 2019, spanning spoken conversation, fiction, popular magazines, newspapers, and academic journals. It is updated annually and is the most widely cited reference corpus for contemporary American English. Raw frequency numbers from COCA are expressed as occurrences per million words — the unit used throughout this article.
Google Books Ngram: 500 Years of Published Text
The Google Books Ngram Corpus covers over 8 million digitized books, totaling roughly 800 billion words in its 2020 edition. The English subset alone represents 468 billion words. Unlike COCA, it includes historical text going back to the 16th century, making it useful for tracking how word frequency has changed over time. Its 1-gram, 2-gram, and 3-gram datasets are publicly downloadable and frequently used in historical linguistics and digital humanities research.
British National Corpus: 100 Million Words
The BNC (natcorp.ox.ac.uk) contains 100 million words of late-20th-century British English, split roughly 90/10 between written and spoken material. It is smaller than COCA but carefully balanced and widely used in academic linguistics. The BNC frequency ranks differ slightly from COCA, reflecting British versus American vocabulary preferences — for example, "whilst" appears far more often in the BNC than in COCA, and certain modal verbs have notably different relative frequencies across the two corpora.
The 50 Most Frequent English Words
The table below lists the 50 most frequent words in English based on COCA frequency data, expressed as occurrences per million words. The top 50 words account for roughly 40% of all tokens in any typical English text. You can run the text through the frequency analyzer on any of your own documents to see how its top-word profile compares to these corpus baselines.
| Rank | Word | COCA freq/million | Part of Speech |
|---|---|---|---|
| 1 | the | 61,847 | Article |
| 2 | be | 42,937 | Verb |
| 3 | and | 28,572 | Conjunction |
| 4 | of | 27,981 | Preposition |
| 5 | a | 26,734 | Article |
| 6 | in | 22,491 | Preposition |
| 7 | to (prep) | 20,284 | Preposition |
| 8 | have | 14,971 | Verb |
| 9 | to (inf) | 14,816 | Particle |
| 10 | it | 14,527 | Pronoun |
| 11 | I | 13,904 | Pronoun |
| 12 | that | 13,618 | Conjunction |
| 13 | for | 13,011 | Preposition |
| 14 | on | 11,736 | Preposition |
| 15 | with | 10,924 | Preposition |
| 16 | he | 10,622 | Pronoun |
| 17 | as | 10,411 | Conjunction |
| 18 | you | 9,930 | Pronoun |
| 19 | do | 9,723 | Verb |
| 20 | at | 9,500 | Preposition |
| 21 | this | 9,143 | Determiner |
| 22 | but | 8,857 | Conjunction |
| 23 | his | 8,706 | Pronoun |
| 24 | by | 8,434 | Preposition |
| 25 | from | 8,052 | Preposition |
| 26 | they | 7,963 | Pronoun |
| 27 | we | 7,831 | Pronoun |
| 28 | say | 7,634 | Verb |
| 29 | her | 7,423 | Pronoun |
| 30 | she | 7,214 | Pronoun |
| 31 | or | 7,088 | Conjunction |
| 32 | an | 6,854 | Article |
| 33 | will | 6,733 | Modal |
| 34 | my | 6,504 | Pronoun |
| 35 | one | 6,320 | Numeral/Pronoun |
| 36 | all | 6,201 | Determiner |
| 37 | would | 6,095 | Modal |
| 38 | there | 5,934 | Adverb |
| 39 | their | 5,812 | Pronoun |
| 40 | what | 5,698 | Pronoun |
| 41 | so | 5,603 | Adverb |
| 42 | up | 5,489 | Adverb |
| 43 | out | 5,387 | Adverb |
| 44 | if | 5,201 | Conjunction |
| 45 | about | 5,098 | Preposition |
| 46 | who | 4,987 | Pronoun |
| 47 | get | 4,876 | Verb |
| 48 | which | 4,765 | Pronoun |
| 49 | go | 4,654 | Verb |
| 50 | me | 4,543 | Pronoun |
Zipf's Law: The Power Law That Governs Language
The frequency drop-off in the table above is not random. It follows a mathematical regularity discovered by linguist George Kingsley Zipf in the 1930s. Zipf's law states that the frequency of a word is approximately inversely proportional to its rank: the most common word appears roughly twice as often as the second most common word, three times as often as the third, and so on.
frequency(rank) ≈ C / rank^s
where C is a constant and s ≈ 1 for natural language. In practice, the fit is not perfect — the very top words ("the", "be") tend to be slightly more frequent than a pure Zipf distribution predicts, and the tail words tend to be slightly less frequent. But the approximation is close enough to be operationally useful. A simple Python function demonstrates how well it fits the COCA top-10:
import math
def zipf_expected_frequency(rank, top_word_freq=61847, s=1.0):
"""Estimate frequency per million for rank N using Zipf's law."""
return top_word_freq / (rank ** s)
# How well does the model fit COCA top-10?
for rank in range(1, 11):
predicted = zipf_expected_frequency(rank)
print(f"Rank {rank:2d}: predicted {predicted:7.0f} / million") The practical implication of Zipf's law is that vocabulary coverage scales logarithmically with the number of word types learned. The top 100 words cover approximately 50% of all tokens in typical text. The top 1,000 cover roughly 75%. The remaining approximately 171,000 words of English — everything below rank 1,000 — cover the remaining 25%. This asymmetry is why focused vocabulary study of high-frequency words pays such large dividends early in language learning.
Function Words vs Content Words
Scanning the top-50 table reveals an immediately striking pattern: almost every high-ranking word is a function word — an article (the, a, an), a preposition (of, in, to, with, on, by), a conjunction (and, but, or, that, if), or a pronoun (it, I, he, you, she, they, we). These words are the grammatical scaffolding of English sentences. They appear in virtually every sentence regardless of topic, which is precisely why they accumulate such enormous frequencies in any large corpus.
Content words — nouns, main verbs, adjectives, and adverbs that carry the semantic payload of a sentence — are distributed across a much wider range of frequency ranks. The first unambiguously "content" verb in the COCA list is "say" at rank 28. Common content nouns like "time", "people", "year", and "way" appear somewhere in the 50–200 rank range.
This distinction has two important applications in natural language processing. First, removing function words as "stop words" is a standard preprocessing step in information retrieval and text classification, where content words carry the discriminative signal. Removing the top 100 function words before indexing eliminates roughly half of all tokens while losing almost none of the meaning. Second — and more surprisingly — in authorship attribution and stylometry, function word frequencies are actually more informative than content word frequencies. Because function word usage is largely unconscious and resistant to deliberate manipulation, an author's function-word fingerprint is remarkably stable across documents and genres.
How Frequency Varies by Register
Register refers to the variety of language associated with a particular situation or domain. A corpus frequency list derived from mixed text hides substantial variation across registers. A word frequency analyzer applied to texts from different registers reveals these differences clearly.
Spoken Conversation: Contractions and Discourse Markers
Spoken English has higher frequencies of pronouns (I, you, we), contractions (it's, don't, I'm), and discourse markers (well, so, like, you know) than written English. COCA's spoken subcorpus shows "like" at 2.3 times its written frequency and "I" at 1.8 times its overall frequency. Sentence fragments, disfluencies, and repetition also push up the frequency of short function words relative to written registers.
Academic Writing: Low Function-Word Density, High Technical Vocabulary
Academic text has a lower frequency of first-person pronouns — passive voice constructions reduce "I" and "we" substantially — and a higher frequency of nominalizations, which are nouns derived from verbs (for example, "consideration" instead of "consider", "implementation" instead of "implement"). Words like "however", "therefore", "respectively", and "significant" have frequency ratios 4 to 8 times higher in academic than in spoken subcorpora of COCA.
News Media: Named Entities Dominate the Middle Frequency Band
In news text, proper nouns and named entities cluster heavily in the middle frequency ranks. Words like "president", "government", "official", and "percent" are far more common in news than in fiction or spoken language. The news register also has unusually high frequencies of past-tense forms and reporting verbs ("said", "announced", "confirmed"), reflecting the retrospective nature of news reporting.
Fiction: Character Names as Statistical Anomalies
In fiction corpora, character names create artificial frequency spikes at ranks that would otherwise be occupied by common nouns. A novel featuring a character named "James" can push that name to a top-100 frequency within the novel's own token distribution, even though it is rare in the general corpus. This effect makes fiction subcorpora particularly unsuitable as training data for general-purpose frequency lists — character names inflate the apparent frequency of specific proper nouns while the general vocabulary profile remains broadly similar to other written registers.
Practical Applications
Language Learning — Learn the Top 1,000 First
Research on reading comprehension demonstrates that knowing the most frequent 1,000 words of a language provides coverage of about 75% of the tokens in any text you are likely to encounter. Knowing the top 3,000 words brings coverage to approximately 95%. This is the empirical basis for frequency-ordered vocabulary lists in language teaching — the methodology behind resources like the General Service List, the Academic Word List, and frequency-ordered language-learning applications.
The comprehensible input hypothesis (sometimes called the i+1 hypothesis) suggests that learners acquire language most efficiently when they understand approximately 95–98% of the tokens in a text. Frequency data lets teachers and course designers select reading and listening materials appropriate to a learner's current vocabulary level, rather than relying on subjective difficulty judgments. A text with too many words below rank 3,000 will overwhelm a beginner; a text with nothing below rank 500 provides no acquisition opportunity for an advanced learner.
NLP Preprocessing — Stop Word Lists from Corpus Data
Stop words are typically defined as words appearing in the top 50 to 200 of a frequency list. Most NLP libraries — NLTK, spaCy, scikit-learn — provide built-in stop word lists derived from corpus frequency data. Using corpus frequency data directly, rather than a static built-in list, lets you tune the cutoff to your specific domain. A corpus of legal documents has different high-frequency function words than a corpus of customer support tickets, and a domain-tuned stop list will give better downstream results for classification or topic modeling.
from collections import Counter
import re
def build_stop_words(corpus_text, top_n=100):
words = re.findall(r"\b[a-z]+\b", corpus_text.lower())
freq = Counter(words)
return {word for word, _ in freq.most_common(top_n)} To see word frequency data for your own text — finding which words dominate your writing and how their distribution compares to a general corpus — paste your own text into the word frequency analyzer and examine the resulting rank-frequency table.
SEO — Long-Tail Queries Follow Zipf's Distribution
Search query frequency follows the same power-law distribution as word frequency. A small number of head queries ("weather", "gmail", "youtube") account for a disproportionate share of total search volume, while the vast majority of queries — each individually rare — collectively account for the "long tail" that makes up over 50% of search traffic. SEO strategies that target long-tail queries rely on this distributional property: because the head is dominated by established players with enormous authority, and because long-tail queries collectively represent more total traffic, targeting the specific, lower-competition terms in the middle and tail of the distribution often yields better return on investment than competing for head terms directly.
The same logic applies to content strategy more broadly. A site that covers a topic deeply — addressing the specific, lower-frequency questions that users ask — accumulates long-tail traffic that eventually exceeds the traffic from a small number of highly competitive head terms. Zipf's law quantifies this intuitively: if the distribution of query frequency follows a power law, the area under the tail of the curve is large relative to the area under the head, even though each individual tail query has far lower volume.
The top 50 words in this post account for roughly 40% of the tokens you just read. Zipf's law is not a curiosity — it is a structural property of language that appears consistently across every corpus, language, and medium ever studied. Understanding it changes how you approach vocabulary acquisition, text preprocessing, and content strategy.
Running your own text through a word frequency analyzer reveals where your writing sits relative to general corpus frequencies — whether you use function words at typical rates or lean toward an unusually dense or sparse style, and which content words you repeat more than you might expect. The data the top 50 table provides is a baseline; the interesting analysis begins when you compare your own text against it.