Archive for January 2009

R and the New York Times


I just had an article posted on the new Information Management (DM Review) media site on statistical learning in R. For those unfamiliar, the R Project for Statistical Computing is fast becoming the statistical and analytics platform of choice for a world-wide cadre of academic and open source-inclined statisticians. I love R and have been gratified to cover its rapid progress in the BI media for OpenBI.

On January 7, The New York Times published an article Data Analysts Captivated by R’s Power,  that coincided conveniently with a webinar co-hosted by Jaspersoft and OpenBI on extending Jaspersoft’s BI capabilities through platform integration with R.

Overall, I thought the article was pretty good, but noted a few annoyances in my Information Management article. First, I don’t know of anyone who’d describe R as “a supercharged version of Microsoft’s Excel spreadsheet software”. Most R users view Microsoft as the evil empire and acknowledge them as little as possible. Second, and more critically, the article didn’t pay proper homage to the work of John Chambers and colleagues at Bell Labs who developed R’s predecessor S in the 80’s and 90’s. To be sure, Ross Ihaka and Robert Gentleman (the R guys) worked heroically to engender R, but R is essentially a re-write of S – so without S there’d be no R. Indeed Insightful, purveyor of S+, the commercial version of S, was recently acquired by TIBCO in a market concession move. S+ is an outstanding product, but lost the battle to its open source kin.

My overall take on the article, though, was positive: I was happy to see R get the long-overdue attention from a mainstream publication like the Times. Follow-up postings and blogs seemed to clarify the ambiguities. Several participants in R’s passionate support forums couldn’t seem to let go, however, claiming all types of ulterior motives for both the article and platform authors.

Commercial software vendors regularly attack an open source entry in their market by promoting FUD – fear, uncertainty, and doubt. An article quote on R by a SAS marketing executive seemed to hit a FUD nerve with the open source community:

“I think it addresses a niche market for high-end data analysts that want free, readily available code,” said Anne H. Milley, director of technology product marketing at SAS. She adds, “We have customers who build engines for aircraft. I am happy they are not using freeware when I get on a jet.”

To which Vanderbilt Professor Frank Harrell, an esteemed member of the R community, responded:

“It’s interesting that SAS Institute feels that non-peer-reviewed software with hidden implementations of analytic methods that cannot be reproduced by others should be trusted when building aircraft engines.”

 

Touche.

Musings on Bayes and BI


I recently posted Part 1 of a series on Bayes and Business Intelligence for the B-eye-Network: http://www.b-eye-network.com/view/9391, and am now busy writing Part 2. The more I get into Bayesian thinking, the more I realize esteemed Stanford statistician Brad Efron is correct. To be a Bayesian, an analyst must always think like one.

In its simplest form, Bayes Law can be explained as follows: If E is an event or hypothesis of interest and D is some data or evidence, we are concerned about P(E|D), the probability of  hypothesis E given or conditioned on evidence D. P(E|D)  can be calculated as P(E)*P(D|E)/(P(D|E)*P(E) + P(D|~E)*P(~E)), where ~E and ~D mean not event E and not data D respectively. P(E|D) is often called the posterior probability, while P(E) is known as the prior probability, P(D|E) is called the likelihood function, and the ugly right-side denominator is a normalizing factor. So we have the posterior probability = prior probability*likelihood function/normalizing factor. What makes this mumbo-jumbo pertinent is that it provides a powerful way of helping BI realize its charter of facilitating sequential and adaptive organizational learning. We can assess the posterior probability of an important business outcome given a shift in company strategy or operations by establishing the known prior probabilities and wrestling through a likelihood function. The calculated posterior probabilities from step one then become the priors for step two, and the posteriors ~= priors*likelihood cycle repeats, promoting adaptive learning.

I saw a first-hand illustration of this Bayesian thinking over the weekend. I spent both Saturday and Sunday watching the initial league matches for my daughter’s 15 year old volleyball team. 160 Midwest teams started competition at 3 locales in the Chicagoland area. The teams were seeded prior to play based on last year’s performance, coaches’ evaluations, random assignment of new clubs, etc. They then went through 2 rounds of pool play to further determine rankings for league competition that starts the week after next. Won/Loss record and score differential determined the movement, if any, from initial ratings. Based on the preliminary rankings and the results of the first weekend of play, the teams are divided into 16 progressive brackets of 10 each for inter-bracket competition that will ultimately yield seedings for nationals to be held in June. Teams are able to move up and down into other brackets between January and June based on performance, their likelihood function. At the end of the season, the initial rankings, the priors, and league/tournament performance, the likelihood, determine the ultimate team rankings after finals – the posteriors. Of course, the whole process starts over in 2010 for 16 year olds, with the 2009 posterior rankings becoming next year’s priors.

 

|