Finding Large Average Submatricies in High Dimensional Data

Andrey A. Shabalin1, Victor J. Weigman2, Charles M. Perou3,4,5, and Andrew B. Nobel1,3

Published in The Annals of Applied Statistics, Volume 3, Number 3 (2009), 985-1012.
Also available at arxiv.org.

1 Department of Statistics and Operations Research, University of North Carolina at Chapel Hill
2 Department of Biology, University of North Carolina at Chapel Hill
3 Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill
4 Department of Genetics, University of North Carolina at Chapel Hill
5 Department of Pathology, University of North Carolina at Chapel Hill

Abstract:

The search for sample-variable associations is an important problem in the exploratory analysis of high dimensional data. Biclustering methods search for sample-variable associations in the form of distinguished submatrices of the data matrix. (The rows and columns of a submatrix need not be contiguous.) In this paper we propose and evaluate a statistically motivated biclustering procedure (LAS) that finds large average submatrices within a given real-valued data matrix. The procedure operates in an iterative-residual fashion, and is driven by a Bonferroni-based significance score that effectively trades off between submatrix size and average value. We examine the performance and potential utility of LAS, and compare it with a number of existing methods, through an extensive three-part validation study using two gene expression datasets. The validation study examines quantitative properties of biclusters, biological and clinical assessments using auxiliary information, and classification of disease subtypes using bicluster membership. In addition, we carry out a simulation study to assess the effectiveness and noise sensitivity of the LAS search procedure. In closing, we propose that LAS is a simple and effective exploratory tool for discovery of biologically relevant structures in high dimensional data.

Download:

  LAS program with user interface.
May need .NET framework (get at Windows Update or directly).
Can be run on Mac OS X and Unix/Linux using Mono.

LAS Guide and Manual.
Describes the input and output file formats.

Command line version of LAS.
Can be used for batch processing. Included in the download above as LAScon.exe. For help run LAScon.exe with no parameters.

Matlab code for LAS. With or without breast cancer data.

Supplementary materials (mirror).

The breast cancer data used in the manuscript.

 

Why use LAS/Biclustering?

LAS biclustering reveals the structure in your data that is not visible after row/column clustering. For example, look at the heatmap of a microRMA dataset below and one bicluster (submatrix) with large negative average to its right. Even though the rows and columns of the dataset are hierarchically clustered it would be hard to spot the bicluster with a naked eye.