Explorations of the lineup protocol for visual inference: application to high dimension, low sample size problems and metrics to assess the quality

Thumbnail Image
Roy Chowdhury, Niladri
Major Professor
Dianne Cook
Committee Member
Journal Title
Journal ISSN
Volume Title
Research Projects
Organizational Units
Organizational Unit
As leaders in statistical research, collaboration, and education, the Department of Statistics at Iowa State University offers students an education like no other. We are committed to our mission of developing and applying statistical methods, and proud of our award-winning students and faculty.
Journal Issue
Is Version Of

Statistical graphics play an important role in exploratory data analysis, model checking and diagnosis. Recent developments suggest that visual inference helps to quantify the significance of findings made from graphics. In visual inference, lineups embed the plot of the data among a set of null plots, and engage a human observer to select the plot that is most different from the rest. If the data plot is selected it corresponds to the rejection of a null hypothesis. With high dimensional data, statistical graphics are obtained by plotting low-dimensional projections, for example, in classification tasks projection pursuit is used to find low-dimensional projections that reveal differences between labelled groups. In many contemporary data sets the number of observations is relatively small compared to the number of variables, which is known as a high dimension low sample size (HDLSS) problem. The research conducted and described in this thesis explores the use of visual inference on understanding low dimensional pictures of HDLSS data. This approach may be helpful to broaden the understanding of issues related to HDLSS data in the data analysis community. Methods are illustrated using data from a published paper, which erroneously found real separation in microarray data. The thesis also describes metrics developed to assist the use of lineups for making inferential statements. Metrics measure the quality of the lineup, and help to understand what people see in the data plots. The null plots represent a finite sample from a null distribution, and the selected sample potentially affects the ease or difficulty of a lineup. Distance metrics are designed to describe how close the true data plot is to the null plots, and how close the null plots are to each other. The distribution of the distance metrics is studied to learn how well this matches to what people detect in the plots, the effect of null generating mechanism and plot choices for particular tasks. The analysis was conducted on data collected from Amazon Turk studies conducted with lineups for studying an array of exploratory data analysis tasks. Finally an R package is constructed to provide open source tools to use visual inference and distance metrics.

Subject Categories
Wed Jan 01 00:00:00 UTC 2014