Thursday, February 20, 2014

Vocabulary Richness and Joseph Smith's Writings

A Stylometric Analysis of Mormon Scripture and Related Texts
D. I. Holmes

Summary:

Holmes data best support the following observations:
  1. Joseph Smith's personal writings are in a style completely and objectively different from all of the scriptural works he produced. 
  2. Isaiah is also identifiably different from these works. 
  3. The scriptural works produced by Joseph Smith show evidence of at least 3-4 different authorial styles.
  4. Holmes's conclusions regarding Joseph Smith's 'prophetic voice' are poorly supported by his data.
  5. OR Holmes did a poor job of applying his methods (I don't believe this).
  6. OR Holmes methods and results are not informative (I don't believe this either, although it is approximately the claim made in the JBMRS reviews).

Data Selection

I was prepared to find Holmes's study to be really poorly done, after reading the reviews in JBMRS. I was pleasantly surprised to find that his data presentation is really clear and professional. I think it is almost essential for a believing Latter-day Saint to skip Holmes's introduction and conclusions if he is to have any hope of examining the data somewhat objectively. Consequently, I'm going to do that. The background on methods is worth noting, but requires fairly serious understanding of statistics to make complete sense of it. So I'm going to skip that, too, for the most part. That brings us all the way to what Holmes is studying and what results he presents from that study. I'll begin with Holmes's list of selected texts:
You can see from this that Holmes selected authors who are recognizable to a Mormon audience. He also included three selections from Isaiah, three from the personal writings of Joseph Smith, three from the Doctrine and Covenants, and one from the Book of Abraham. I say 'selections from', but the selections really include the entirety of most of these collections, split up into approximately 10000 word chunks. Right off one might wonder if some of these shouldn't show up as single author signals. When Nephi reported Lehi's words, did he preserve Lehi's style? Who wrote down Alma's words, and did Mormon influence the style? It's pretty clear to begin with that author attribution is going to be difficult, but lets see what happens when Holmes applies his relatively objective, statistical tools to these texts. Then we can see if any of our questions are answered.

Dendrogram Grouping

Here is the first figure:
I honestly don't know how to interpret this figure. It makes a number of groupings, and accurately separates Joseph Smith and Isaiah from the rest of the selections, but with differences of only a few percent. I don't know what to do with it.

Principle Component analysis of 5 partially independent vocabulary richness variables

Moving on, we get the most important figures of the paper:
These figures are where Holmes takes all of his various results and displays them in this informative, graphical fashion. You see, Holmes has 5 variables measuring vocabulary richness. Some of them give pretty much the same information, and others give different information. By statistically combining them, Holmes is able to separate out important vocabulary richness signals that he calls 'Principle Components' (for mathematical reasons I mostly understand, but that aren't necessary for our evaluation). One thing I should note is that the top graph provides more statistically significant information than the bottom graph, accounting for approximately 77% of the variation among selections. As far as I can judge, there is no reason to doubt Holmes statistical abilities or the quality of his data. There is further no reason to doubt his professional credentials. So I trust that the data are real. In fact, the BYU statisticians that have criticized his work haven't criticized his results, but only his choice of methods and his conclusions. So I'm going to believe the data, completely. Now what do we see in the data?

Holmes leads the reader to the conclusions I have circled in the figure. Joseph Smith makes one clear group. Isaiah makes another clear group. Everything else makes a third clear group. QED. No need to go any farther.  All these writings represent 3 authors. Now I'll insert a little of Holmes's historical analysis:
"The overwhelming evidence, therefore, suggests that the Book of Abraham was a product of the mind of Joseph Smith."
The only thing that remains is to conclude that Joseph Smith has a distinct prophetic voice that shows up in all of his 'revelatory' writings.

I do have a few remaining questions. Do any other known authors show two extremely distinct voices in their writings? I've heard they don't, even when they are pretending to write as two different narrators. What does Holmes know about this? He doesn't tell us, anywhere. I know that over time an author's signal can change, but the Book of Mormon, the Doctrine and Covenants, and the Book of Abraham were written over an extended time span, as were Joseph Smith's personal writings, so it seems hard to think that time could be the explanation.

Reexamination of the Principle Component analysis

And how about the 'prophetic voice' by itself. Let me regroup things on Holmes's two principle figures:

I tried regrouping the authors a couple of different ways. The blue ovals preserve Holmes's groupings. You'll notice again that the 'prophetic voice' grouping is much larger than the other two. So I copied the Isaiah oval. In the upper graph these copies are in purple. Three authors with signals as varied as Isaiah's don't even cover all of the Mormon Scripture samples. But maybe the Isaiah oval is unusually small. So I used the Joseph Smith oval. You can see the copies in red. With these circles, Nephi can be recognized as a single author, and most of the Mormon scripture selections can be covered with just two circles and a few outliers. Problem is, Holmes hasn't provided us with any objective way to deal with outliers. Do other authors have outliers? How far out? What causes them? It's not like me making a measurement error with one of my experiments. You can't just statistically remove 10000 words of text out of existence. Those words must belong to some author. They weren't written by mistake.

Maybe Joseph Smith is still too small a circle to represent the style of a typical author. I drew green dashed circles around the selections from 'Mormon'. This circle includes all of Mormon scripture--except one Doctrine and Covenants sample. Also, it includes two Isaiah samples. And a circle that large is unable to distinguish between Joseph Smith and Isaiah. To my mind, the justification for grouping all of Mormon scripture together is getting more dubious. I made a little more quantitative comparison:

I measured the distances between the farthest samples of single authors on Holmes's plots. I normalized them all to the Isaiah distance. By the Isaiah difference, we have 6-10 authors in Mormon scripture, and two and a half authors in Joseph Smith's personal writings. Using Joseph Smith's value of 2.3 as a guide, we have at least 2-3 other authors in Mormon scripture, and that doesn't count any authors that might have had similar vocabulary richnesses. That does happen, by the way. M1 and D1 might still be outliers, but there could be explanations for those, like differences in genre and age of the author, or greater or lesser involvement of particular scribes. Other than those exceptions, all the other authors fall pretty comfortably in single author sized distances. Since Holmes was pushing the conclusion that this huge group was all one stylometric signal, he didn't provide any real discussion of the outliers, and we can't explain them without duplicating his work. I personally don't see the point in that.

Sichel Parameters

I further spent some time looking up Holmes's references and a few papers that reference his work. I found the work of Sichel particularly interesting.

Sichel uses alpha and theta to take a whole set of articles written by an 'unknown' number of authors and then decide how many authors wrote 1 article, 2 articles, 3 articles, etc. These parameters successfully identify the total numbers of authors who authored a given number of articles quite accurately, and without externally biasing the author selection. What I mean is, Sichel didn't say, "there are 300 authors, figure out how many articles each one wrote." He just said, "There are 600 articles. How many authors wrote articles, and how many did each write?" This doesn't imply that Sichel could correctly identify the authors with 100% accuracy (or any accuracy, necessarily), but he could get the number of authors right on a lot of different data sets. Look at well he did with a couple of representative data sets in the figure.
I don't claim to understand everything that went into this analysis, but all of the data sets look very similar to these two. If 203 authors each wrote 1 article, Sichel didn't predict that 250 or 170 authors wrote one article. He predicted 206.7, or 203, or 200. He got really close, within 1-2%. Why didn't Holmes finish the job and do Sichel's prediction of the total number of authors in Holmes's data set? It is an obvious and natural extension of the paper he references, and should have been possible with the data collected. What would expect from Holmes analysis if his conclusions are correct? Here's a table of my own:


Number of Authors


Number of Selections
Observed
Holmes Conclusion
Expected (Sichel Method)
Names of Observed Authors
Names of Authors Proposed by Holmes
1
3
0
?
Lehi, Jacob, Abraham

2
2
0
?
Alma, Moroni

3
4
2
?
Isaiah, Joseph Smith, Nephi, Doctrine and Covenants
Isaiah, Joseph Smith
4
0
0
?


5
1
0
?
Mormon

18
0
1
?

Prophetic Voice
Holmes concluded that Joseph Smith wrote three selections (J1-3), Isaiah wrote three selections (I1-3), and Joseph Smith wrote the other 18 in his 'prophetic voice'. Is that what the Sichel parameters actually predicted? Or would they have predicted 3, or 4, or 10 additional authors, as is qualitatively suggested by Holmes's principle component data? I don't really believe this prediction would be very informative, except to decide if Joseph Smith's revelations were ONE prophetic voice or LOTS of voices. Since this appears to have been a key purpose of Holmes's study, according to Holmes himself, I fault him seriously for this oversight.

Conclusions
Holmes's results confirm, in broad strokes, some principle conclusion of the original paper on Book of Mormon stylometry, namely, that the Book was written by several authors (at least 3), that none of these authors styles match the style of the personal writings of Joseph Smith, and that the authors have measurably different styles than Isaiah--a known major contributor to the Book of Mormon. Instead of criticizing Holmes's paper, apologists should be embracing it. Yes, his conclusions aren't supported by his data, and his introduction shows at best a weak (and at worst an ideologically biased) understanding of Mormon history, but his data are great--as far as they go. In attempting to show that all Mormon scripture was just the invention of Joseph Smith, Holmes appears to have shown that Joseph Smith dictated with at least four measurably different stylometric signatures. Quite a feat.

Links

Holmes's original paper
Sichel's 1985 paper: I can provide a pdf of this paper upon request.

Friday, February 14, 2014

Stylometric Analysis of Mormon Scripture

About 15 years ago John Hilton spoke to my senior religion seminar for science majors about his statistical analysis of Book of Mormon authorship. It was exciting to see the non-contextual word methods explained and to see the graphs going up showing that Joseph Smith, Oliver Cowdery, Sidney Rigdon, and others had different stylistic signatures from Nephi and Alma. It was exciting that this could be shown with objective methods that didn't rely on the highly subjective types of authorship analysis typically used in sorting out biblical authorship questions. He showed how the same methods were used to identify authors on the Federalist Papers, and how the results were very convincing compared to the results by other statistical authorship attribution methods. In particular, he criticized the vocabulary richness methods used by David Holmes in attributing Book of Mormon authorship entirely to Joseph Smith. He showed that Holmes's methods were unable to distinguish among know authors on things like the Federalist Papers, and mentioned that Holmes had moved on in other projects to use the non-contextual word analysis like that used by Hilton and his colleagues. Hilton showed us how non-contextual word methods could distinguish between an author's own writings and that same author's translations of other works. We saw convincing, multidimensional graphs showing that the Doctrine and Covenants had a different signal from Joseph Smith's and Oliver Cowdery's personal writings, suggesting a different revelatory voice for Joseph Smith (and still different from the Book of Mormon).

Hilton also explained to us a number of pitfalls in stylometric (statistical analyses of word use) studies. Apparently it is well documented that changes in genre can drastically change word use. For example, when we speak we use a smaller vocabulary than when we write, and we also use non-contextual words at significantly different rates. So it is very important to compare similar genres when doing stylometric analyses.

I basically fell in love with Hilton's work and took away from it a disdain for Holmes's study. In 2008 another stylometric analysis of the Book of Mormon came out. I found out about it roughly a year ago. It wasn't easily accessible, but a couple of reviews and some explanations of new stylometric analyses were published in the Journal of the Book of Mormon and Restoration Scripture, so I read those. Those articles pointed out what seemed like obvious, fatal flaws due to the hypothesis put forward in the 2008 study, so I never got very interested in actually looking at the original. Then I got more involved with internet Mormonism and a couple of things happened. I discovered that a number of thoughtful Mormons are automatically suspicious of anything that comes out of the Neal A. Maxwell Institute. I also had a chance encounter with Craig Criddle, the primary investigator on the 2008 study. He and I didn't hit it off (he was intent on pushing a Spaulding/Rigdon hypothesis for Book of Mormon authorship, however tenuous the data may be, and I am totally confident in the, at least primarily, ancient origins of the Book of Mormon), but I was able to listen a little and get a little better perspective on the work he was involved with.

Now I come to why I'm writing this series of posts. Few of my internet Mormon friends find Hilton's work as convincing as I do. I think an interpretation of Hilton's work limited to his strongest conclusions is quite compelling. The minimal message is that Nephi is not Alma is not Anybody Modern who was involved with the Book of Mormon according to any (even weak) historical evidence. His data don't make any claims about who Nephi and Alma were. They don't make any claims about when Nephi and Alma lived. They don't make any claims about the moral authority of Joseph Smith or the Book of Mormon. Yet for me this is the most objective evidence available regarding the origins of the Book of Mormon. It is reproducible. It uses methods exactly as they have been applied to answer equivalent questions in peer-reviewed literature. It explains its controls and limitations. It doesn't go beyond the best data. I know this because Hilton talked with us about some other, more tentative results. The method suggests several more authors exist in the Book of Mormon, but Hilton's confidence in those results was less either because of shifts in genre or just not obtaining quite the 95% confidence chosen by statisticians as a cut off. Because of this I have felt no qualms about claiming the Book of Mormon contains at least two authors who were not the proposed 19th century authors. I've stated that any critic of Book of Mormon antiquity needs to deal with this objective fact, and I don't think any have.

Recently, I have been asked by a couple of people to help them understand this assertion. I decided to get all of the original papers and try to understand them myself, as I would a chemistry paper. Very often a chemist does not have the specific expertise to critique all of the methods and assertions in a paper. We rely on a history of expertise, logical presentation of the material, thorough citation of relevant papers on the subject, and our own skills in data analysis. We look at professional credentials of the authors and knowledge of earlier uses of the methods employed. Thorough citations show that the authors have a command of the subject and have given due consideration to previous work. Then I ask if I know how to look at a graph and interpret what is shown. In the posts that follow, I will show how a chemist trained in data analysis interprets the work of linguists and statisticians. Hopefully by being up front with my known biases and by showing you the data presented in the original papers, I can help you become more comfortable with what stylometric analyses of the Book of Mormon have and have not demonstrated. I think there is an excellent analysis of the various studies written by Matthew Roper, Paul Fields, and Bruce Schaalje, but I am going to take a different approach.

I intend to only look for the clearest, strongest results from each of the stylometric studies, and to see if there is any way to integrate these results into a coherent, non-contradictory whole. Where it is not possible, I hope to explain my reasons for choosing one result over another. I will also add some personal analyses and questions as I go. If you are interested, be prepared to look at a lot of graphs and numbers. I'm considering length no object, but I will try to summarize my conclusions at the beginning and end of each post. I will not be discussing historical evidence of Book of Mormon authorship. The vast majority (and maybe all) of first and second hand, contemporary evidence is that Joseph Smith dictated the vast majority of the book, without reference to any other texts, to Oliver Cowdery, over a period of a couple of months. Everything else is, to my mind, speculation and invention. That doesn't imply that the speculations are false, only highly subjective. My conclusions will not rely on any claims about how the words got into Joseph Smith's head before coming out of his mouth.

Here is a list of the papers I'll be working through, not in any particular order: 

A Stylometric Analysis of Mormon Scripture and Related Texts
D. I. Holmes
http://www.jstor.org/sici?sici=0964-1998%281992%29155%3A1%3C91%3AASAOMS%3E2.0.CO%3B2-Z


Stylometric Analyses of the Book of Mormon: A Short History
Matthew Roper, Paul J. Fields, and G. Bruce Schaalje
http://publications.maxwellinstitute.byu.edu/fullscreen/?pub=1380&index=3

Examining a Misapplication of Nearest Shrunken Centroid Classification to Investigate Book of Mormon Authorship
Reviewed by Paul J. Fields, G. Bruce Schaalje, and Matthew Roper
http://publications.maxwellinstitute.byu.edu/fullscreen/?pub=1462&index=7

On Verifying Wordprint Studies: Book of Mormon Authorship
John L. Hilton
http://publications.maxwellinstitute.byu.edu/fullscreen/?pub=1099&index=12

Who Wrote the Book of Mormon? An Analysis of Wordprints

Wayne A. Larsen and Alvin C. Rencher
http://publications.maxwellinstitute.byu.edu/fullscreen/?pub=1130&index=10
https://byustudies.byu.edu/showTitle.aspx?title=5424


http://www.matthewjockers.net/publications/ (Preprints available)
Jockers, Matthew L. “Testing Authorship in the Personal Writings of Joseph Smith Using NSC Classification.” Literary and Linguistic Computing. 28.3, (2013): 371-381
Jockers, Matthew L., Daniela M. Witten, and Craig S. Criddle. “Reassessing Authorship of the Book of Mormon Using Delta and Nearest Shrunken Centroid Classification.” Literary and Linguistic Computing, 23.4 (2008): 465 – 492.
http://www.jstor.org/stable/2982671

The following are not freely available online. I have (or am acquiring) personal copies, which I may be able to share for personal use. You may also be able to access them through a university library.

Extended nearest shrunken centroid classification: A new method for open-set authorship attribution of texts of varying sizes
G. Bruce Schaalje and Paul J. Fields
http://llc.oxfordjournals.org/content/early/2011/01/18/llc.fqq029.abstract (not free)

Open-Set Nearest Shrunken Centroid Classification
G. Bruce Schaalje and Paul J. Fields
http://www.tandfonline.com/doi/full/10.1080/03610926.2010.529529#.Uv00UbRni2o