Language Bigram Home

Jschreiner Home

West European Language Bigrams In 2-Dimensional Color

What Are These Graphs?

Welcome to Joseph Schreiner’s web site for illustrating patterns of characters in west European languages.  In particular, I illustrate the patterns and frequencies of bigrams, or 2-character combinations.  On this site I illustrate the most common bigrams found in English, German, French, and Spanish.  Not only do we see the most common bigrams, but we also see which bigrams are most likely to precede or follow each other.  Half of the graphs are color-coded, so that we can see in which language the bigrams are most likely to occur.

A7C

This portion of my web site is based on the two Main graphs below – Color and Monochrome.  By clicking on these graphs, you can see the Detail graphs.  When you click on a section of a Main graph, a new window will open with a Detail graph, showing you the magnified section.  As you can see, the Main graphs are quite complex - each pixel has meaning.  The Detail graphs show you the actual bigrams represented by each pixel.

The graphs show us the relationship among bigrams.  They show us how the bigrams precede and follow each other.  Consider the English text red ball.  The graphs break down the text into series of bigrams as preceding-bigrams and following-bigrams, as in:

Preceding Following
Bigram Bigram
re d_
ed _b
d_ ba
_b al
ba ll

For clarity, the space character has been converted to an underscore.  So we see that the preceding-bigram re is followed by the following bigram d_.  ed is followed by _b, and so on.

From the patterns of bigrams in these four west European languages (all using the Roman alphabet), I wove the pattern of the Main graphs.

Monochrome Graph

(Black & White)

Click on a section to see it magnified.

Magnified section will open in a separate window.

You may need to make the new window full screen to see all detail.

monochrome

Color Graph

Click on a section to see it magnified.

Magnified section will open in a separate window.

You may need to make the new window full screen to see all detail.

colorg

Structure of the Main Graphs

The rows (A through I) represent the preceding-bigram.  The columns (1 through 9) represent following-bigrams.  re, as a preceding-bigram, is found in row G.  d_, as a following bigram, is found in column 8.  So we could look at section G8 to see the re-d_ combination, and determine its frequency, and in which language it occurs in most often.

C2C

I illustrate the most common 450 bigrams.  It is too difficult to display more bigrams, given the constraints of browsers and pixel resolution.  The overall graphs are 450 x 450 pixels, but there is no one-to-one correspondence between bigram and pixel.  More common bigrams are displayed with pixel-lengths of greater than 1.00.  Less common bigrams are displayed with pixel-lengths of less than 1.00.

In the Monochrome graphs, bigram frequency is also represented by brightness.  Black means that this bigram combination did not occur in the sample text.  Bright white means that the bigram combination occurred frequently.  Shades of gray represent intermediate frequencies.  The Color graphs also use brightness and pixel-length to express frequency.

In the Color graphs, the color or hue indicates which language most frequently has this bigram combination:

Language Color
Spanish Red
French Yellow
English Green
German Blue

Strong hues indicate that the bigram combination shows a strong preference for one of the languages.  A weak hue, one tending toward white or gray, indicates that the bigram combination occurs nearly equally in all four languages.

G1C

I find the graphs fascinating, and I hope that you do also.  The Main graphs give us a bird’s-eye view of the frequencies.  They may be compared to spectroscopes, or to chemical electrophoresis.  But they do not allow us to see information about the individual bigram.  For that, we must examine the Detail graphs.  Even these graphs are complex, but they show enough detail for us to examine individual bigrams.

Structure of the Detail Graphs

When you click on a section of a Main graph, you will see the corresponding Detail graph, which magnifies that cell.  The pixel lengths are multiplied by 12.  The Detail graphs show the same colors and brightness as the Main graphs.  And they show us the bigrams along the axes.

The Detail graphs are just as complex as the Main graphs.  Remember that these graphs are displaying (on average) 50 bigrams along the horizontal and vertical axes.  The bigrams cannot be displayed along a single line.  In order to list them all, the bigrams are displayed in 4 staggered lines.  The example below shows how I do it.  Along the vertical axis (the preceding-bigram) we see the bigrams co, ha, no, pa, so ...  Along the horizontal axis (the following-bigram) we see the bigrams _p, _c, _l, _e, _d ...

G2C

This is the order in which they occur on the axis.  But I had to place these bigrams on different lines to squeeze them into a manageable space.

How Did I Do This?

For sample text, I used 1.8 megabytes of on-line text (evenly distributed among English, Spanish, French, and German).  60% of the text came from Wikipedia articles (discussing the USA, France, Mexico, Germany, computers, television, religion, the sun, and the moon).  10% came from Yahoo! news articles.  And 30% came from children’s stories, fables, and fairy tales.

I edited the text to remove the square brackets [] of Wikipedia citations, the captions for pictures, and gratuitous line breaks.  I also converted or translated the following characters:

  • Upper case characters were converted to lower case characters.
  • Double-spaces (double-blanks) were converted to single spaces.
  • Spaces became underscores “_”.
  • Numerals were converted to “9”.
  • Punctuation that ends a sentence, or begins a sentence in Spanish, (period, question mark, exclamation point …) was converted to the exclamation point “!”.
  • All other characters were translated to the crosshatch “#”.
  • The beginning and end of paragraphs were converted to "¶¶".

Using Visual Basic, I scanned all text and found all possible bigrams.  I chose the 450 most frequent bigrams for further analysis.  I created separate 450 x 450 crosstabulation tables for all four languages.

I2C

The most complex and subtle part of the process was determining the order of the bigrams on the horizontal and vertical axes.  I followed the principle that bigrams with similar response patterns should be next to each other.  For instance, as preceding-bigrams, ra and na and  are usually followed by the same bigrams.  So these two bigrams are adjacent on the vertical axis, which defines the rows.  As following-bigrams, zu and ko are usually preceded by the same bigrams, so they are adjacent on the horizontal axis, which defines the columns.

The concept of similarity is easy to understand, but difficult to implement in a computational or statistical algorithm.  I tried many customized methods until I settled on the algorithm that I finally used.  If you want to know the details, feel free to send email to me.  Otherwise, I will not dwell on my method here.

Some Results

The Main graphs are rich in detail.  They are similar to fractals, in that they still show much detail even as you look more closely at smaller sections.  I encourage you to examine the Detail graphs.

F6C

But let me provide some orientation.  Let us first look at some of the major horizontal stripes, or rows.

Stripe crossing E & F – These are almost all bigrams whose second character is a space, or blank.  (The Detail graphs show the blanks as underscores).

Blue stripe within I – These are bigrams that tend to occur at the end of German words or syllables, such as ehtz, or hn.

Red stripe within F – These are bigrams with accented vowels that occur in Spanish, such as , án, and .

Green stripe within A – These are bigrams, mostly with y or w as the second character, which often end English words, such as lyow, and sh.

The vertical stripes, or columns, are just as interesting:

E8C

Stripe around 2 – These are bigrams where the first character is a space.

Stripe crossing 5 & 6 – These are bigrams where the first character is rn, or l, and the second character is a consonant.

Red stripe within 7 – These are bigrams where the first character is a or o, and the second character is a space or punctuation (the end of Spanish words or sentences).

And we have some interesting rectangles (the intersection of rows and columns):

D2C

Empty E2 & F2 – The text was edited to convert double-spaces to single-spaces.  The E/F row is bigrams ending with a space.  The column-2 is bigrams beginning with a space.  So there were virtually no bigram combinations that contain two consecutive spaces.

Sparse A5 through F5 – With a few exceptions, the bigrams in rows A through F end in consonants.  Column-5 (which crosses over into column-6) contains double-consonant bigrams.  So this intersection contains three consecutive consonants, which a rare.

Bright E3 through F5 – Columns-3, -4, and half of -5 contain bigrams where the first character is a consonant, and the second character is a vowel.  This combination often appears at the beginning of a word.  So this rectangle represents:

letter-space-consonant-vowel

which is what we find in the transition from one word into another.

Red G2 & H2 – The G/H stripe is mostly bigrams where the first character is a consonant, and the second character is o or a.  So this rectangle is:

consonant-o/a-space-letter

Many Spanish nouns end in o or a, so this represents the transition from one Spanish word into another.

Yellow H2 – Row H is mostly bigrams where the second character is éu, or i.  Many French past participles end in éu, and i.  So this rectangle represents:

consonant-é/u/i-space-letter

the transition from one French word into another.

Any comments?  Questions?  Suggestions?  Please email to:

contact me