Interactive Java Tools For Exploring High Dimensional Data

by

James W. Bradley and R. Webster West

 

 

1. Introduction

 

The World Wide Web (WWW) is a new mechanism for providing information. At this point, the majority of the information on the WWW is static, which means it is incapable of responding to user input. Text, images, and video are examples of static information that can easily be included in a WWW page. With the advent of the Java programming language, it is now possible to embed dynamic information in the form of interactive programs called applets. Therefore, it is not only possible to transfer raw data over the WWW, but we can also now provide interactive graphics for displaying and exploring data in the context of a WWW page. In this paper, we will describe the use of Java applets that have been developed for the interactive display of high dimensional data on the WWW.

 

1.1 The Java Environment

Java is both a high-level programming language and a software platform which has many useful characteristics. It is object-oriented, portable, interpreted, and multithreaded. With a compiler, a Java program is translated into an intermediate language called bytecodes. This platform-independent code is then processed within a Java interpreter such as a web browser. Each Java bytecode is executed after interpretation takes place. This means that compilation happens only once; however, interpretation occurs every time the program is executed. This makes it possible to run Java programs almost anywhere, but it can also be slower than other code. Java programs will run consistently on platforms such as Windows 95, Windows NT, Solaris, and Macintosh. For more information on Java refer to the Java Developer's Guide (Jaworski 1996).

Java applets are similar to applications, but they do not stand-alone. They must run within the context of a Java-capable WWW browser. The most popular browsers, Netscape Navigator and Internet Explorer, are Java-capable. When a WWW page containing an applet is displayed, the applet is loaded and executed, and the applet's output is displayed within a subset of the browser's display area. Because applets are executed locally, interaction with the applet can take place efficiently. Jaworski (1996) provides a nice introduction to applet programming.

 

1.2 Graphical Displays For High Dimensional Data

Suppose we have n observations of p-dimensional data. Each observation is a point in p-dimensional space. Unless p is less than or equal to three, it is not possible to project a p-dimensional set of points onto a two-dimensional plane without losing some information from the data. Techniques for visualizing and exploring multivariate data, especially in a higher dimensional space, are severely limited by the two-dimensional display medium of a computer. Cleveland and McGill (1988) is a good general reference on statistical graphics and the graphical tools available for high dimensional data.

Chernoff (1973) created one of the first graphical techniques for high dimensional data exploration. He proposed mapping variables in a data set into features of faces as a method of visualizing multidimensional data. The pros and cons of icon based methods are discussed further in Asimov (1985) and Chomut (1987). One drawback of these techniques is the difficulty in producing the geometric patterns by a single representation and accurately portraying hypergeometrical structure (Wegman, 1990).

More recently developed tools for visualizing high dimensional data include projection pursuit methods and data imaging. Cook et al. (1993) discussed the projection pursuit procedure which utilizes the successive display of graphs determined by an algorithm designed to look for interesting structure between different variable combinations. Minnotte and West (1999) developed the data image as a graphical tool which represents variables on the vertical axis and observations on the horizontal axis. Observations are color coded relative to the minimum and maximum for each variable. The observations are then rearranged so that observations, which are close together in p-dimensional space, are also close together in the linear ordering represented in the data image. Bands that appear as vertical stripes in the image represent clusters.

In terms of more basic tools, a pairs plot is a matrix of all two-dimensional scatter plots for a data set. This plot is a standard tool for high dimensional data which is very useful for looking at pairwise relationships, exploring collinearity, and getting an overall view of the data set. A three-dimensional scatter plot is also useful for assessing the relationship between three of the variables. This plot is a simple form of "spatialization" in which non-spatial data "dimensions" are mapped to the two-dimensions of a display space. For the pairs plot, there is no way to determine the location of observations in each of the scatter plots unless it is interactive. This capability allows the user to identify particular points and clusters. Likewise, the ability to rotate a three-dimensional scatter plot allows the user to better assess the relationships among the three variables. Thus, for each plot, interactivity is the key feature for providing a more useful tool for data exploration.

Figure 1.1 displays the output from the Splus brush function which creates an interactive combination of the pairs plot and the three dimensional scatter plot. The data shown represents the Splus (1998) states data set. This data set contains the names, population estimate, per capita income, illiteracy rate, life expectancy, murder rate, high school graduation rate, mean number of days with minimum temperatures less than thirty-two degrees, and land area for fifty U.S. states.

 

Figure 1.1

 

Justice can not be given to this plot on a static page. All of the individual plots within the graphics window may have points highlighted interactively. This technique is referred to as brushing. The mouse is used to move the brush over data points of interest. After pressing the mouse button, the points are highlighted on all of the data displays. This plot is very useful for identifying observations included in a cluster and other high dimensional structure.

The parallel coordinate technique was originally proposed and implemented by Inselberg (1985) and task-specific variations of the device have been put forth by other statisticians (Wegman 1990; Miller and Wegman 1991; Jang and Yang 1996). In this plot, the variable axes are parallel instead of perpendicular as with typical scatter plots. Each observation in a data set is represented as an unbroken series of line segments. The value of a specific variable for each observation is plotted along each axis relative to the minimum and maximum values of the variable. The points are then connected using line segments. A parallel coordinate plot created in WebStat (West, 1998) is shown in Figure 1.2 for the states data set.

 

Figure 1.2

 

Its primary advantage over other types of statistical graphics is its ability to display multivariate data in one representation for a large number of variables. Observations with similar data values across all variables will share similar signatures. Thus, clusters of like observations can be seen. The elements of these clusters, however, can not be identified without interactivity. Therefore, making this tool interactive would allow the user to obtain more information about the data set. Correlation among adjacent variables can also be visualized. For instance, two variables negatively correlated will be connected by line segments which cross repeatedly in the region between the axes. For example, in the above plot it is easy to notice the negative association between murder rate and high school graduation. Wegman (1990) provides a good introduction for using parallel coordinates in data analysis.

 

1.3 Graphical Displays In Java

We will develop programs for displaying high dimensional data in a WWW page by implementing some of the basic graphical tools mentioned above into Java applets. All of the applets created will allow for the interactive investigation of a high dimensional data set that allows the user the ability to discover interesting characteristics about data within the context of a Web page.

Interactive versions of a pairs plot, a parallel coordinate plot, and a three-dimensional plot will be developed as Java applets. Also, a combination of these will be developed similar to the display shown in Figure 1.1. The development of the Java applets will be discussed in Section 2. Concluding statements will be made in Section 3.

 

2. Java Applets For Interactive Graphics

 

Applet versions of the interactive graphics discussed in the previous section will be developed in this section. A detailed description will be provided for the creation of four interactive applets; a pairs plot, a three dimensional plot, a parallel coordinates plot, and a plot that consists of a combination of the three previous plots. The applets may be viewed at http://www.stat.sc.edu/~west/bradley/. Basic Java components will be developed, and these components will serve as the building blocks for the applets described above. This approach emphasizes the use of object-oriented programming in Java.

 

2.1 An Interactive Scatter Plot

Object-oriented means the focus is on self-contained software components. In this section, we will create an interactive scatter plot component which will allow for highlighting and labeling specific observations. This component will then serve as the basic building block for an interactive pairs plot.

To create an interactive scatter plot we extend the Java Canvas class which defines a rectangular drawing area. We have chosen to name this new class Plotcanvas. In this class, we translate double data values into pixel values which are then plotted on an offscreen image. This technique, known as double buffering, is necessary so that the graphics will not flicker with user interaction. For a compete description of double buffering, see http://java.sun.com. Mouse events will be used to make it possible to draw a rectangle on the scatter plot object, and the points within the rectangle are highlighted when the mouse button is released. Points are also identified in the status bar by moving the mouse within three pixels of a point. Clicking the mouse will highlight the points identified in the status bar.

ScatterPlot.class is an interactive scatter plot applet. The layout for this applet is fairly simple. A List, a TextArea and a Button will be added to the right of the Plotcanvas object. The Button is used to reset the applet by removing all highlighting. The List object is used to add a list of labels for the points in the plot. These labels are selected in the List object when points are to be highlighted in the plot, and they are added to the list of highlighted points in the TextArea.

 

Figure 2.1

ScatterPlot applet would be here if you had a Java-enabled browser.

 

The applet tag shown below was used to include Figure 2.1 in this document. Interested readers may include any of applets discussed in this paper in their own WWW page by modifying this tag. The name of the applet must be inserted in place of the words, "ScatterPlot.class." The variables parameter lists the names for each of the variables to be included in the plot. Even though there are no variable names presented in Figure 2.1, this parameter is required to define the number of variables to be plotted. The data parameter is used to include the actual data to be plotted. The pidentifiers parameter is used to denote a vector of labels associated with each row of observations. To ensure that these parameters are read properly by the applet, values must be space delimited even if a line return is used. (Some browsers ignore line returns in parameter statements.)

<P ALIGN="CENTER"><APPLET
CODEBASE="http://www.stat.sc.edu/~west/bradley/" CODE="ScatterPlot.class" WIDTH=500 HEIGHT=350>
<PARAM NAME="variables" VALUE="
Income 
HSGrad 
">
<PARAM NAME="data" VALUE="
3624 41.3 
6315 66.7 
4530 58.1 
3378 39.9 
5114 62.6 
4884 63.9 
5348 56.0 
4809 54.6 
4815 52.6 
4091 40.6 
4963 61.9 
4119 59.5 
5107 52.6 
4458 52.9 
4628 59.0 
4669 59.9 
3712 38.5 
3545 42.2 
3694 54.7 
5299 52.3 
4755 58.5 
4751 52.8 
4675 57.6 
3098 41.0 
4254 48.8 
4347 59.2 
4508 59.3 
5149 65.2 
4281 57.6 
5237 52.5 
3601 55.2 
4903 52.7 
3875 38.5 
5087 50.3 
4561 53.2 
3983 51.6 
4660 60.0 
4449 50.2 
4558 46.4 
3635 37.8 
4167 53.3 
3821 41.8 
4188 47.4 
4022 67.3 
3907 57.1 
4701 47.8 
4864 63.5 
3617 41.6 
4468 54.5 
4566 62.9 
">
<PARAM NAME="pidentifiers" VALUE="
Alabama 
Alaska 
Arizona 
Arkansas 
California 
Colorado 
Connecticut 
Delaware 
Florida 
Georgia 
Hawaii 
Idaho 
Illinois 
Indiana 
Iowa 
Kansas 
Kentucky 
Louisiana 
Maine 
Maryland 
Massachusetts 
Michigan 
Minnesota 
Mississippi 
Missouri 
Montana 
Nebraska 
Nevada 
NewHampshire 
NewJersey 
NewMexico 
NewYork 
NorthCarolina 
NorthDakota 
Ohio 
Oklahoma 
Oregon 
Pennsylvania 
RhodeIsland 
SouthCarolina 
SouthDakota 
Tennessee 
Texas 
Utah 
Vermont 
Virginia 
Washington 
WestVirginia 
Wisconsin 
Wyoming 
">
<B>ScatterPlot applet would be here if you had a Java-enabled
browser.</B></APPLET></P>

 

2.2 Interactive Pairs Plot

The Plotcanvas object created in section 2.1 can be easily used to create an interactive pairs plot. This applet will be called PairsPlot.class. The applet consists of a grid of Plotcanvases which are linked together so that they can communicate. The PairsPlot applet for the states data set is included in Figure 2.2.

 

Figure 2.2

PairsPlot applet would be here if you had a Java-enabled browser.

 

 

2.3 Interactive Three Dimensional Plot

The three dimensional plot applet will be called ThreeDplot.class. The layout will have a similar structure to the applets discussed above. This applet, however, will have two additional components. First, a new panel will contain a canvas used to spin the plot. Using basic drawing techniques, a canvas of arrows representing all of the directions for spinning is constructed. The user may spin the plot by pressing the mouse over an arrow. When the mouse is pressed the arrow will fill with color, and the plot will begin to spin. Upon releasing the mouse, spinning will stop and the color will be removed. Secondly, changing variables in the plot is accomplished by the use of the Choice class which is used to implement pull-down lists. These lists allow the user to select the variables to be displayed in the three dimensional plot. An example of this applet follows in Figure 2.3 for the states data set.

 

Figure 2.3

ThreeDplot applet would be here if you had a Java-enabled browser.

 

 

2.4 Interactive Parallel Coordinates Plot

The interactive parallel coordinates applet will be called ParallelCplot.class. The layout for this applet is the same as the layout for the interactive pairs plot applet described in section 2.2. Adding the highlighting capability to the parallel coordinate plot is a bit tricky. This process requires a great deal of up front calculation before the graphic is displayed and a sophisticated routine for dealing with user interaction. The applet with the states data included is shown in Figure 2.4

 

Figure 2.4

ParallelCplot applet would be here if you had a Java-enabled browser.

 

 

2.5 A Brush Applet

We have also created an applet which is very similar to the Splus brush function shown in Figure 1.1. In this applet, we have taken each of the components described in the previous sections and included them in one grand applet called DynamicTool.class. The pairs plot is the center of the applet with either the parallel coordinates plot or the three dimensional plot also included. The bottom of the applet contains two check boxes, one for the parallel coordinates plot and one for the three dimensional plot so that the user may decide which plot to display in the applet. Figure 2.5 shows this applet for the states data.

 

Figure 2.5

DynamicTool applet would be here if you had a Java-enabled browser.

 

3. Conclusions

 

As the WWW becomes the avenue by which more and more people access data, it is important that software tools exist which allow individuals to easily understand and analyze data in the context of a WWW page. The applets developed in this paper are meant to serve this purpose for high dimensional data sets. For further examples of these applets in action, see http://www.stat.sc.edu/~west/bradley/Data.html. We encourage any interested readers to include these applets in their WWW pages.

It is important to note that we chose to work with Java 1.0.2 compiler in order to ensure that most current browsers are compatible with our applets. In the future, as WWW browsers upgrade to newer versions of Java, we hope to provide new more advanced versions of these applets. In addition, the functionality of these applets will be built into the next version of WebStat so that users can easily explore their own data sets without having to construct a WWW page containing these applets.

 

References

 

Asimov, D., (1985). The Grand Tour: A Tool for Viewing Multidimensional Data. SIAM Journal of Scientific and Statistical Computing, 6(1), 128-143.

Chernoff, H. (1973). The use of Faces to Represent Points in k-Dimensional Space Graphically. Journal of the American Statistical Association, 68, 361-368.

Chomut, T., (1987). Exploratory Data Analysis in Parallel Coordinates. Los Angeles Scientific Center.

Cleveland, W. S. and McGill, M. E. (1988). Dynamic Graphics for Statistics,Wadsworth, Monterey, CA.

Cook, D., Buja, A., and Cabrera, J. (1993). Projection Pursuit Indexes Based on Orthonormal Function Expansions. Journal of Computational and Graphical Statistics, 2(3), 225-250.

Cornell, G. and Horstmann, C. (1996). Core Java, Sun Soft Press.

Fisher, R. A. (1936). The Use of Multiple Measurements in Taxonomic Problems. Annals of Eugenics, 7(II), 179-188.

Inselberg, A. (1985). The Plane with Parallel Coordinates. The Visual Computer, 1, 69-91.

Jang, D. H., and Yang, S. J. (1996). The Dynamic Parallel Coordinates Plot and its Applications. Journal of the Korean Statistical Society, 9(1), 45-52.

Jaworski, J. (1996). Java Developer’s Guide, 1st Ed., Indianapolis: Sams.net Publishing.

Miller, J. J., and Wegman, E. J. (1991). Construction of Line Densities for Parallel Coordinate Plots. Computing and Graphics in Statistics, 107-123.

Minnotte, M. C., and West, R. W., (1999). The data image: a tool for exploring high dimensional data sets. 1998 Proceedings of the ASA Section on Statistical Graphics, in press.

S-Plus, (1988). MathSoft Inc., AT&T.

Statistical Abstract of the United States, (1977) and County and City Data Book, (1977). U.S. Department of Commerce, Bureau of the Census.

Statistical Abstract of the United States, (1998) and State and Metropolitan Area Data Book, (1998). U.S. Department of Commerce, Bureau of the Census.

Sun Microsystems, Inc., (1999). The Source For Java Technology, WWW document. http://java.sun.com.

Wegman, E., (1990). Hyperdimensional Data Analysis Using Parallel Coordinates. Journal of American Statistics Association, 85, 664-675.

West, R. W. (1998). WebStat 1.0, WWW document. http://www.stat.sc.edu/webstat/version1.0.