![]() |
What follows are
essentially notes, hastily written in preparation to write introductory
material for this page. I figured that I'd just go ahead and put the page
up and worry about cleaning up the introduction later. The wordlist is
what is important, not my explanation.
This page is created primarily for students learning English as a second language. Acquiring an adequate vocabulary is one of the most important requirements of mastering a new language. But English contains over 300,000 words. Which ones are the most important to learn? To answer this question, I downloaded 17 text files (listed below) from the Gutenberg Project. I combined the text files with an large (450k) file of quotations i had collected over the years of my personal reading/research. |
1. | A Personal Record, by Joseph Conrad (1882) |
2. | Heart of Drakness, by Joseph Conrad (1894) |
3. | Democracy and Education, by John Dewey (1916) |
4. | Herland, by Charlotte P. Gilman (1915) |
5. | Looking Backward, 2000 to 1887, by Edward Bellamy (1887) |
6. | McTeague, by Frank Norris (1894) |
7. | My Antonia, by Willa Cather (1918) |
8. | The Age of Innocence, by Edith Wharton (1920) |
9. | The Call of the Wild, by Jack London (1903) |
10. | The Jungle, by Upton Sinclair (1906) |
11. | The Life of Me, by Clarence Johnson (1978) |
12. | The Voyage Out, by Virginia Woolf (1915) |
13. | This Side of Paradise, by F. Scott Fitzgerald (1920) |
14. | Walden, by Henry David Thoreau (1854) |
15. | William Gibson interviewed by Giuseppe Salza (1994) |
16. | John F. Kennedy's Inaugural Address (1960) |
17. | Bill Clinton's Inaugural Address (1992) |
18. | various quotations from my own reading (72,070 words) |
All of this text
was combined into one very large (7,551,435 bytes) text file. That file
was subjected to Conc 1.76 (a program from The Summer Institute of Linguistics, Dallas Texas) for creating concordances. The software proceeded to generate
a list of all the words used along with the number of occurrences of each
word.
The raw data contained 41,119 entries and a total of 1,362,482 words. This file was, however, dirty. There was a considerable amount of cleaning up that had to be done in order to make this work useful to students of language. I removed proper nouns (names of people, cities, etc. [anything with capital letters]). I removed foreign words, except for those in common usage in English. Of course, many of the words isolated by the concondance generating software are actually different forms of the "same" word, i.e.: |
image
imaged imagery images imaginable imaging imagine imagined imagines imagining |
imaginings
imagination imaginations imaginative imaginatively imaginary imagist imagists imagism imagectomy |
So, it's not actually
words, but concepts that are dealt with in this work. Now, there were certain
problems that arose in the process of combining various word-forms into
their respective concepts. For instance, found is not only the past
tense of find but also the present tense of an altogether different verb
meaning approximately establish. Tear means to rip as well
as the moisture that comes from your eye. Since I don't plan on spending
my entire life straightening out all these complexities of English, i have
placed asterisks after words that encompass more than one concept.
While I was combining various word-forms it became apparent that this process applied to other categories of words as well. It seemed a reasonable to combine all forms of the verb to be. Family-relationship words (mother, sister, aunt... ) could easily be combined. The following groups have been established: |
1. | pronouns |
2. | to be |
3. | question words |
4. | number words |
5. | family relationship words |
6. | color words |
7. | direction words |
If you click on
the linked words, you will find an analysis (including frequency data)
of the individual words included in each group. It seems like a good idea
to me. If you have any suggestions for other words/concepts that might
be grouped, please contact me.
This word/concept analysis is useful for at least two groups of people. Students of English as a Second Language will do well to master the first 3000 words. This is the heart of English. Students of gendo who are learning EarthLing, a debugged subset of wild english, designed for clear thinking and accurate communication, will also find the gendo wordlist interesting. |