Thursday, February 20, 2014

Vocabulary Richness and Joseph Smith's Writings

A Stylometric Analysis of Mormon Scripture and Related Texts
D. I. Holmes

Summary:

Holmes data best support the following observations:
  1. Joseph Smith's personal writings are in a style completely and objectively different from all of the scriptural works he produced. 
  2. Isaiah is also identifiably different from these works. 
  3. The scriptural works produced by Joseph Smith show evidence of at least 3-4 different authorial styles.
  4. Holmes's conclusions regarding Joseph Smith's 'prophetic voice' are poorly supported by his data.
  5. OR Holmes did a poor job of applying his methods (I don't believe this).
  6. OR Holmes methods and results are not informative (I don't believe this either, although it is approximately the claim made in the JBMRS reviews).

Data Selection

I was prepared to find Holmes's study to be really poorly done, after reading the reviews in JBMRS. I was pleasantly surprised to find that his data presentation is really clear and professional. I think it is almost essential for a believing Latter-day Saint to skip Holmes's introduction and conclusions if he is to have any hope of examining the data somewhat objectively. Consequently, I'm going to do that. The background on methods is worth noting, but requires fairly serious understanding of statistics to make complete sense of it. So I'm going to skip that, too, for the most part. That brings us all the way to what Holmes is studying and what results he presents from that study. I'll begin with Holmes's list of selected texts:
You can see from this that Holmes selected authors who are recognizable to a Mormon audience. He also included three selections from Isaiah, three from the personal writings of Joseph Smith, three from the Doctrine and Covenants, and one from the Book of Abraham. I say 'selections from', but the selections really include the entirety of most of these collections, split up into approximately 10000 word chunks. Right off one might wonder if some of these shouldn't show up as single author signals. When Nephi reported Lehi's words, did he preserve Lehi's style? Who wrote down Alma's words, and did Mormon influence the style? It's pretty clear to begin with that author attribution is going to be difficult, but lets see what happens when Holmes applies his relatively objective, statistical tools to these texts. Then we can see if any of our questions are answered.

Dendrogram Grouping

Here is the first figure:
I honestly don't know how to interpret this figure. It makes a number of groupings, and accurately separates Joseph Smith and Isaiah from the rest of the selections, but with differences of only a few percent. I don't know what to do with it.

Principle Component analysis of 5 partially independent vocabulary richness variables

Moving on, we get the most important figures of the paper:
These figures are where Holmes takes all of his various results and displays them in this informative, graphical fashion. You see, Holmes has 5 variables measuring vocabulary richness. Some of them give pretty much the same information, and others give different information. By statistically combining them, Holmes is able to separate out important vocabulary richness signals that he calls 'Principle Components' (for mathematical reasons I mostly understand, but that aren't necessary for our evaluation). One thing I should note is that the top graph provides more statistically significant information than the bottom graph, accounting for approximately 77% of the variation among selections. As far as I can judge, there is no reason to doubt Holmes statistical abilities or the quality of his data. There is further no reason to doubt his professional credentials. So I trust that the data are real. In fact, the BYU statisticians that have criticized his work haven't criticized his results, but only his choice of methods and his conclusions. So I'm going to believe the data, completely. Now what do we see in the data?

Holmes leads the reader to the conclusions I have circled in the figure. Joseph Smith makes one clear group. Isaiah makes another clear group. Everything else makes a third clear group. QED. No need to go any farther.  All these writings represent 3 authors. Now I'll insert a little of Holmes's historical analysis:
"The overwhelming evidence, therefore, suggests that the Book of Abraham was a product of the mind of Joseph Smith."
The only thing that remains is to conclude that Joseph Smith has a distinct prophetic voice that shows up in all of his 'revelatory' writings.

I do have a few remaining questions. Do any other known authors show two extremely distinct voices in their writings? I've heard they don't, even when they are pretending to write as two different narrators. What does Holmes know about this? He doesn't tell us, anywhere. I know that over time an author's signal can change, but the Book of Mormon, the Doctrine and Covenants, and the Book of Abraham were written over an extended time span, as were Joseph Smith's personal writings, so it seems hard to think that time could be the explanation.

Reexamination of the Principle Component analysis

And how about the 'prophetic voice' by itself. Let me regroup things on Holmes's two principle figures:

I tried regrouping the authors a couple of different ways. The blue ovals preserve Holmes's groupings. You'll notice again that the 'prophetic voice' grouping is much larger than the other two. So I copied the Isaiah oval. In the upper graph these copies are in purple. Three authors with signals as varied as Isaiah's don't even cover all of the Mormon Scripture samples. But maybe the Isaiah oval is unusually small. So I used the Joseph Smith oval. You can see the copies in red. With these circles, Nephi can be recognized as a single author, and most of the Mormon scripture selections can be covered with just two circles and a few outliers. Problem is, Holmes hasn't provided us with any objective way to deal with outliers. Do other authors have outliers? How far out? What causes them? It's not like me making a measurement error with one of my experiments. You can't just statistically remove 10000 words of text out of existence. Those words must belong to some author. They weren't written by mistake.

Maybe Joseph Smith is still too small a circle to represent the style of a typical author. I drew green dashed circles around the selections from 'Mormon'. This circle includes all of Mormon scripture--except one Doctrine and Covenants sample. Also, it includes two Isaiah samples. And a circle that large is unable to distinguish between Joseph Smith and Isaiah. To my mind, the justification for grouping all of Mormon scripture together is getting more dubious. I made a little more quantitative comparison:

I measured the distances between the farthest samples of single authors on Holmes's plots. I normalized them all to the Isaiah distance. By the Isaiah difference, we have 6-10 authors in Mormon scripture, and two and a half authors in Joseph Smith's personal writings. Using Joseph Smith's value of 2.3 as a guide, we have at least 2-3 other authors in Mormon scripture, and that doesn't count any authors that might have had similar vocabulary richnesses. That does happen, by the way. M1 and D1 might still be outliers, but there could be explanations for those, like differences in genre and age of the author, or greater or lesser involvement of particular scribes. Other than those exceptions, all the other authors fall pretty comfortably in single author sized distances. Since Holmes was pushing the conclusion that this huge group was all one stylometric signal, he didn't provide any real discussion of the outliers, and we can't explain them without duplicating his work. I personally don't see the point in that.

Sichel Parameters

I further spent some time looking up Holmes's references and a few papers that reference his work. I found the work of Sichel particularly interesting.

Sichel uses alpha and theta to take a whole set of articles written by an 'unknown' number of authors and then decide how many authors wrote 1 article, 2 articles, 3 articles, etc. These parameters successfully identify the total numbers of authors who authored a given number of articles quite accurately, and without externally biasing the author selection. What I mean is, Sichel didn't say, "there are 300 authors, figure out how many articles each one wrote." He just said, "There are 600 articles. How many authors wrote articles, and how many did each write?" This doesn't imply that Sichel could correctly identify the authors with 100% accuracy (or any accuracy, necessarily), but he could get the number of authors right on a lot of different data sets. Look at well he did with a couple of representative data sets in the figure.
I don't claim to understand everything that went into this analysis, but all of the data sets look very similar to these two. If 203 authors each wrote 1 article, Sichel didn't predict that 250 or 170 authors wrote one article. He predicted 206.7, or 203, or 200. He got really close, within 1-2%. Why didn't Holmes finish the job and do Sichel's prediction of the total number of authors in Holmes's data set? It is an obvious and natural extension of the paper he references, and should have been possible with the data collected. What would expect from Holmes analysis if his conclusions are correct? Here's a table of my own:


Number of Authors


Number of Selections
Observed
Holmes Conclusion
Expected (Sichel Method)
Names of Observed Authors
Names of Authors Proposed by Holmes
1
3
0
?
Lehi, Jacob, Abraham

2
2
0
?
Alma, Moroni

3
4
2
?
Isaiah, Joseph Smith, Nephi, Doctrine and Covenants
Isaiah, Joseph Smith
4
0
0
?


5
1
0
?
Mormon

18
0
1
?

Prophetic Voice
Holmes concluded that Joseph Smith wrote three selections (J1-3), Isaiah wrote three selections (I1-3), and Joseph Smith wrote the other 18 in his 'prophetic voice'. Is that what the Sichel parameters actually predicted? Or would they have predicted 3, or 4, or 10 additional authors, as is qualitatively suggested by Holmes's principle component data? I don't really believe this prediction would be very informative, except to decide if Joseph Smith's revelations were ONE prophetic voice or LOTS of voices. Since this appears to have been a key purpose of Holmes's study, according to Holmes himself, I fault him seriously for this oversight.

Conclusions
Holmes's results confirm, in broad strokes, some principle conclusion of the original paper on Book of Mormon stylometry, namely, that the Book was written by several authors (at least 3), that none of these authors styles match the style of the personal writings of Joseph Smith, and that the authors have measurably different styles than Isaiah--a known major contributor to the Book of Mormon. Instead of criticizing Holmes's paper, apologists should be embracing it. Yes, his conclusions aren't supported by his data, and his introduction shows at best a weak (and at worst an ideologically biased) understanding of Mormon history, but his data are great--as far as they go. In attempting to show that all Mormon scripture was just the invention of Joseph Smith, Holmes appears to have shown that Joseph Smith dictated with at least four measurably different stylometric signatures. Quite a feat.

Links

Holmes's original paper
Sichel's 1985 paper: I can provide a pdf of this paper upon request.

No comments:

Post a Comment