Skip to Content

College of Arts & Sciences
Department of Statistics

Roger Bilisoly Colloquium

Thursday, November 5, 2015 - 2:45pm

Statistics Department Colloquium

Where: Close-Hipp Building, Room 364

Speaker: Roger Bilisoly

Affiliation: Central Connecticut State University, Department of Mathematical Sciences

Title:  Generalizing Distance and Moments for Text Mining

Abstract: In general, a text is a sequence of symbols, which can be thought of as a categorical time series. Unlike numerical data, it is not obvious how to compute the mean or the variance of a set of symbols, but if this could be done, text analogs of traditional statistical techniques could be constructed. It turns out that the first two moments can be defined in terms of optimization: the minimum value of $\frac{1}{n} \sum{i=1}^{n} (x_i - c)^2$  is the variance of the xi, and the value of c at that minimum is the mean. Noting that $(x_i - c)^2$ is the squared Euclidean distance on the real line, this suggests the possibility of generalization by minimizing $\frac{1}{n} \sum{i=1}^{n} \text{distance}(x_i, c)^2$ for non-Euclidean distances. This talk applies this approach to the Levenshtein edit distance, which is defined for pairs of strings. To illustrate these ideas we consider the variability of William Langland’s Middle English alliterative poem, Piers Plowman, a problem studied in detail by philologist and medievalist Walter W. Skeat in the second half of the 19th century. 

PDF icon Full abstract for Roger Bilisoly presentation