Archive for March 2009

The Divine R


A colleague recently asked me for a good introductory text on the R statistical computing platform. Though there are a seemingly endless number of published books on R, I recommended a personal favorite, Introductory Statistics with R, Second Edition (Peter Dalgaard). The book does an excellent job introducing the R language as well as demonstrating R’s usage for solving real world statistical problems.

I chuckle when I read uncomplimentary reviews of R documentation by analytics pundits. In addition to scores of books, comprehensive reference manuals, and online help and documentation, there’s a wealth of R “how to” publications written by the community and freely available to anyone with internet search. One such gem that I recently discovered is The R Inferno by Patrick Burns. The abstract to this brief (103 pages) pdf concisely conveys the tome’s goal: “If you are using R and you think you’re in hell, this is a map for you.”

Not only is Burns a capable R analyst, he’s also a very clever writer. The R Inferno is a play on the Inferno cantica of Dante Alighieri’s The Divine Comedy, in which Dante navigates the nine circles of hell. The circles are concentric, each progressively more depraved, representing ever increasingly grievous sins, ultimately culminating with Satan in the center of hell.

Burns sees the journey through R learning hell with a similar lens. His concentric circles depict problems that typically trip up those new to R. Much attention is focused on vectorizing computations to perform efficiently. My experience is proof positive the new R programmers often bring procedural baggage to their learning. Burns also obsesses on the many benefits of modular function development in R, as well as its various flavors of object orientation. The eighth circle, Believing It Does What is Intended, addresses scores of R gotchas, and is pertinent for even the most experienced R programmers. Finally, circle nine clearly articulates the R community-established norms for asking help of the many support lists. The uninitiated who routinely leap before they look are not treated charitably in R land.

After reading Inferno, I was prompted to look in the attic for one of my all time favorite computer books, the now 35 year old The Elements of Programming Style, by Kernighan and Plauger. (Aging analysts might recognize Brian Kernighan as co-author with Dennis Ritchie of C Programming Language, one of the most important programming books of the last 30 years.) Just as Burns uses the Divine Comedy as a metaphor for his writing, Kernighan and Plauger model the timeless and concise writing manifesto, Elements of Style, by Strunk and White, as their guide. And just as I try to remember important S&W dictums like “Put sentences in a positive form”, “Omit needless words”, and “Revise and rewrite” when writing, so too do I look to K&P’s wisdom — “Let the data structure the program”, “Don’t patch bad code; rewrite it”, “Watch out for off-by-one errors”, “Make sure your code ‘does nothing’ gracefully”, and “Make it right before you make it faster” – to structure programming work. Much like Elements of Style and The Elements of Programming Style, The R Inferno is destined to become a manuscript that ages well – that always rewards those who invest the time to review.

 

Planning for Predictive Models – Wisdom From Regression Modeling Strategies


I’m getting ready to start another predictive modeling effort and decided to turn to several trusted stats books for a quick review. Three favorites include Maindonald and Braun’s Data Analysis and Graphics Using R,  The Elements of Statistical Learning, by Hastie, Tibshirani and Friedman, and Frank Harrell’s Regression Modeling Strategies. The books provide a nice balance of theory and practice, statistical inference and statistical learning.

 

I didn’t even get past the Preface to RMS before I started taking notes on important considerations for planning my new prediction studies. Indeed, I found the emphases spot on, even though I’m not certain whether I’ll use the regression models that Frank espouses or the statistical learning models of ESL.

 

The following are nuggets of wisdom from RMS for planning/executing modeling studies, along with a statistical blogger’s commentary:

 

1)      The cost of data collection outweighs the cost of data analysis. This means it’s critical to maximize the value of data in hand and to analyze it judiciously. It also underscores the oft-heard warning from Predictive Analytics World that quality data is perhaps the leading critical risk/success factor for predictive analytics projects.

2)      Prudent handling of missing data is critical. Simple deletion of cases for which there are missing attributes can lead to prediction coefficients that are either terribly biased or grossly inefficient. There’re well-developed methodologies and statistical procedures for “imputing” missing values that should be a part of the analyst’s arsenal.

3)      Mean square error, which equals variance + bias, is generally a criterion for evaluating a model. Statisticians often look first for unbiased estimates, but it may be better in many cases to trade off a small amount of bias for reduced variance.

4)      Analysts need to pay special attention to non-linearity and non-additivity in their models. The careless deployment of simple linear models is often a by-product of the regression capabilities of BI tools. A miss-specified model may lead to erroneous predictions and results. Techniques like cubic splines are available for testing and incorporating these complications in standard models.

5)      Graphical methods to support the understanding of complex models are critical. The connection of predictive models to graphics is particularly strong in R. The lattice graphics pioneered by William Cleveland and included in R are central to its productivity and popularity.

6)      Methods for handling large numbers of predictors are central to today’s predictive models. Fortunately, there are answers like data reduction methods (e.g. principal components) from the multivariate statistics world, as well as Least Angle Regression (LARS), the Lasso, Random Forests, and Gradient Boosting from statistical learning.

7)      Overfitting is a common problem. Model validation approaches that include the bootstrap and cross validation are now central to estimating and testing. The stepwise regression procedures I learned in grad school 30 years ago are now non-grata in the prediction world. Fortunately, resampling techniques that are part and parcel of statistical practice have come to the rescue.

 

 

 

Rattle Redux and Predictive Analytics World Potpourri


 

Rattle

 

I received an email from John Maindonald the other day. A little over a year ago, I wrote a review of an excellent statistical text, Data Analysis and Graphics Using R, John co-authored with John Braun. Part of his message was to inform that the 3rd edition of Data Analysis would be coming out soon. Maindonald is also on the faculty of the Australian National University, co-teaching a course on data mining with Graham Williams. Williams is the developer of Rattle, the R Analytical Tool To Learn Easily, a front end to the significant machine learning/data mining capabilities of R. The second piece of John’s message was a request to update the url to the course for Information Management readers. Done. I would highly recommend Math3346 for those seeking an accessible treatment of applied data mining.

 

Predictive Analytics World

 

As I mentioned in last week’s blogs, I was pleasantly surprised by version one of Predictive Analytics World, finding it quite useful on a number of levels. Today, I offer a few final observations on the conference.

 

I guess I shouldn’t be too surprised that the most oft-cited success (or risk) factors for analytics deployments have to do not with analytics per se, but rather with business sponsorship, business/IT/analytics team alignment, methodology, data quality, communication, incremental wins, and governance. It appears lessons learned for predictive analytics look much like those for broader business intelligence.

 

On the evening of Wednesday, Feb 18, The Bay Area useR Group (R Programming Language), held its meeting using PAW hotel facilities. 70 people, many of whom were not R users, listened to presentations by commercial R vendor Revolution Computing as well as web titans Facebook and Google. Both Facebook and Google are big advocates of R’s open source analytics and graphical capabilities, employing analysts who learned the package in grad school. R is particularly popular for preliminary, exploratory data analysis (EDA) tasks.

 

I was a bit surprised by the limited range of analytics techniques demonstrated in the technical sessions I attended. Logistic regression and CART seemed the norm for classification problems, while ordinary least squares and stepwise regression appeared the choice for interval-level prediction. One session presented a hand-rolled ensemble of logistic regressions, demonstrating reduced variance and sharpened predictions – results R users take for granted with Random Forests and Gradient Boosting. Maybe I’m just spoiled by the embarrassment of riches available to predictive modelers in R. There are now scores of the very latest techniques accessible for free.

 

The Bay area is home to the top two schools of statistics in the U.S., Stanford and Cal Berkeley. It’d been nice to have an academic perspective on the current state of predictive analytics, especially given the rapid developments in both statistics and machine learning. One of the Stanford professors among Trevor Hastie, Rob Tibshirani, or Jerome Friedman, co-authors of the just-released book, The Elements of Statistical Learning, Second Edition, would have been an ideal presenter. Perhaps next year there can be sessions surveying both statistical learning and Bayes modeling.

 

Looking forward to PAW 2010!

 

 

 

|