<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>R-statistics blog &#187; Tal Galili</title>
	<atom:link href="http://www.r-statistics.com/author/admin/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.r-statistics.com</link>
	<description>Writing about statistics with R, and open source stuff (software, data, community)</description>
	<lastBuildDate>Thu, 29 Jul 2010 01:51:23 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
		<item>
		<title>Blogging about R &#8211; presentation and audio</title>
		<link>http://www.r-statistics.com/2010/07/blogging-about-r-presentation-and-audio/</link>
		<comments>http://www.r-statistics.com/2010/07/blogging-about-r-presentation-and-audio/#comments</comments>
		<pubDate>Thu, 29 Jul 2010 01:48:29 +0000</pubDate>
		<dc:creator>Tal Galili</dc:creator>
				<category><![CDATA[R]]></category>
		<category><![CDATA[R community]]></category>
		<category><![CDATA[wordpress]]></category>
		<category><![CDATA[blog]]></category>
		<category><![CDATA[blogging]]></category>
		<category><![CDATA[useR]]></category>
		<category><![CDATA[useR conference]]></category>
		<category><![CDATA[useR2010]]></category>

		<guid isPermaLink="false">http://www.r-statistics.com/?p=499</guid>
		<description><![CDATA[At the useR!2010 conference I had the honor of giving a (~15 minute) talk titled &#8220;Blogging about R&#8221;. The following is the abstract I submited, followed by the slides of the talk and the audio file of a recording I made of the talk (I am sad it got a bit of &#8220;hall echo&#8221;, but it&#8217;s still listenable&#8230;) P.S: this post does not absolve me from writing up something (with many thanks and links to people) about the useR2010 conference, [...]]]></description>
			<content:encoded><![CDATA[<p>At the <a href="http://user2010.org/">useR!2010</a> conference I had the honor of giving a (~15 minute) talk titled &#8220;Blogging about R&#8221;.  The following is the abstract I submited, followed by the slides of the talk and the audio file of a recording I made of the talk (I am sad it got a bit of &#8220;hall echo&#8221;, but it&#8217;s still listenable&#8230;)</p>
<p><em>P.S: this post <strong>does not</strong> absolve me from writing up something (with many thanks and links to people) about the useR2010 conference, but I can see it taking a bit longer till I do that.<br />
</em><br />
&#8212;&#8212;&#8212;&#8212;&#8212;&#8211;</p>
<h3>Abstract of the talk</h3>
<p>This talk is a basic introduction to blogs: why to blog, how to blog, and the importance of the R blogosphere to the R community.</p>
<p>Because R is an open-source project, the R community members rely (mostly) on each other&#8217;s help for statistical guidance, generating useful code, and general moral support.</p>
<p>Current online tools available for us to help each other include the R mailing lists, the community R-wiki, and the R blogosphere.  The emerging R blogosphere is the only source, besides the R journal, that provides our community with articles about R.  While these articles are not peer reviewed, they do come in higher volume (and often are of very high quality).</p>
<p>According to the meta-blog <a href="http://www.r-bloggers.com/">R-bloggers.com</a>, the (English) R blogosphere has produced, in January 2010, about 115 &#8220;articles&#8221; about R. There are (currently) a bit over 50 bloggers (now about 100) who write about R, with about 1000 (now ~2200) subscribers who read them daily (through e-mails or RSS). These numbers allow me to believe that there is a genuine interest in our community for more people &#8211; perhaps you? &#8211; to start (and continue) blogging about R.</p>
<p>In this talk I intend to share knowledge about blogging so that more people are able to participate (freely) in the R blogosphere &#8211; both as readers and as writers.  The talk will have three main parts:</p>
<ul>
<li>What is a blog
</li>
<li>How to blog – using  the (free) blogging service WordPress.com (with specific emphasis on R)</li>
<li>
How to develop readership &#8211; integration with other social media/networks platforms, SEO, and other best practices</li>
</ul>
<p>*  *  *<br />
Tal Galili founded www.R-bloggers.com and blogs on www.R-statistics.com<br />
*  *  *</p>
<h3>Audio recording of the talk</h3>
<p><span id="more-499"></span><br />
<a href="http://www.r-statistics.com/wp-content/uploads/2010/07/Tal Galili - Blogging about R - useR2010.ogg">Click here to download the audio file</a></p>
<p><embed src="http://www.r-statistics.com/wp-content/uploads/2010/07/Tal Galili - Blogging about R - useR2010.ogg"></p>
<h3>Slides</h3>
<p class="gde-text"><a href="http://www.r-statistics.com/wp-content/uploads/2010/07/Blogging%20about%20R.pdf" target="_blank" class="gde-link">Download (PDF, 5.09MB)</a></p>
<iframe src="http://www.r-statistics.com/wp-content/plugins/google-document-embedder/proxy.php?url=http%3A%2F%2Fwww.r-statistics.com%2Fwp-content%2Fuploads%2F2010%2F07%2FBlogging%2520about%2520R.pdf&hl=cs&gdet=&embedded=true" width="500" height="370" frameborder="0" style="min-width:305px;" class="gde-frame"></iframe>


]]></content:encoded>
			<wfw:commentRss>http://www.r-statistics.com/2010/07/blogging-about-r-presentation-and-audio/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Richard Stallman talk+Q&amp;A at the useR! 2010 conference (audio files attached)</title>
		<link>http://www.r-statistics.com/2010/07/richard-stallman-talkqa-at-the-user-2010-conference-audio-files-attached/</link>
		<comments>http://www.r-statistics.com/2010/07/richard-stallman-talkqa-at-the-user-2010-conference-audio-files-attached/#comments</comments>
		<pubDate>Mon, 26 Jul 2010 19:39:15 +0000</pubDate>
		<dc:creator>Tal Galili</dc:creator>
				<category><![CDATA[R]]></category>
		<category><![CDATA[R community]]></category>
		<category><![CDATA[cloud computing]]></category>
		<category><![CDATA[conference]]></category>
		<category><![CDATA[copyleft]]></category>
		<category><![CDATA[free doftware]]></category>
		<category><![CDATA[GPL]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[software as service]]></category>
		<category><![CDATA[useR]]></category>
		<category><![CDATA[useR 2010]]></category>
		<category><![CDATA[useR2010]]></category>

		<guid isPermaLink="false">http://www.r-statistics.com/?p=483</guid>
		<description><![CDATA[The current hosting provider of the files couldn&#8217;t handle the work load. I am now moving the file to a different (hopefully more robust) hosting solution. Please come back in an hour or so to download the files. The files are online again! (The audio files of the full talk by Richard Stallman are attached to the end of this post.) &#8212;&#8212;&#8212;&#8212;&#8212;&#8211; Last week I had the honor of attending the talk given by Richard Stallman, the last keynote speaker [...]]]></description>
			<content:encoded><![CDATA[<p><del datetime="2010-07-27T10:32:41+00:00">The current hosting provider of the files couldn&#8217;t handle the work load.<br />
I am now moving the file to a different (hopefully more robust) hosting solution.<br />
Please come back in an hour or so to download the files.</del><br />
The files are online again!<br />
(<strong>The audio files of the full talk by Richard Stallman are attached to <u><a href="http://www.r-statistics.com/2010/07/richard-stallman-talkqa-at-the-user-2010-conference-audio-files-attached/#more-483">the end of this post.</a></u></strong>)</p>
<p>&#8212;&#8212;&#8212;&#8212;&#8212;&#8211;</p>
<p>Last week I had the honor of attending the talk given by <a href="http://en.wikipedia.org/wiki/Richard_Stallman">Richard Stallman</a>, the last keynote speaker on the <a href="http://user2010.org/">useR 2010</a> conference.  In this post I will give a brief context for the talk, and then give the audio files of the talk, with some description of what was said in the talk.</p>
<h3>Context for the talk</h3>
<p><span style="text-decoration: underline;"><strong>Richard Stallman </strong></span>can be viewed as (one of) the fathers of free software (free as in speech, not as in beer).</p>
<p>He is the man who led the <a href="http://www.gnu.org/">GNU project</a> for the creation of a free (as in speech, not as in beer) operation systems on the basis of which GNU-Linux, with its numerous distributions, was created.<br />
Richard also developed a number of pieces of widely used software, including the original Emacs,[4] the GNU Compiler Collection,[5], the GNU Debugger[6], and many tools in the GNU Coreutils</p>
<p>Richard also initiated the free software movement and in October 1985 he also founded it&#8217;s formal foundation and co-founded the League for Programming Freedom in 1989.</p>
<p>Stallman pioneered the concept of &#8220;copyleft&#8221; and he is the main author of several copyleft licenses including the GNU General Public License, the most widely used free software license.</p>
<p>You can read about him in the wiki article titles &#8220;<a href="http://en.wikipedia.org/wiki/Richard_Stallman">Richard Stallman</a>&#8221;</p>
<p><span style="text-decoration: underline;"><strong>The useR 2010 conference</strong><strong> </strong></span>is an annual 4 days conference of the community of people using R.  <a href="http://www.r-project.org/">R</a> is a free open source software for data analysis and statistical computing (Here is a bit more about <a href="http://www.r-statistics.com/2009/03/what-is-r/">what is R</a>).</p>
<p>The conference this year was truly a wonderful experience for me.  I  had the pleasure of giving two talks (about which I will blog later this month), listened to numerous talks on the use of R, and had a chance to meet many (<strong>many</strong>) kind and interesting people.</p>
<h3>Richard Stallmans talk</h3>
<p>The talk took place on July 23rd 2010 at NIST U.S.  and was the concluding talk for the useR2010 conference.  The talk consisted of a two hour lecture followed by a half-hour question and answer session.</p>
<p>On a personal note, I was very impressed by Richards talk.  Richard is not a shy computer geek, but rather a serious leader and thinker trying to stir people to action.  His speech was a sermon on free software, the history of GNU-Linux, the various versions of GPL, and his own history involving them.</p>
<p>I believe this talk would be of interest to anyone who cares about social solidarity, free software, programming and the hope of a better world for all of us.</p>
<p>I am eager for your thoughts in the comments (but please keep a kind tone).</p>
<p><strong><span style="text-decoration: underline;">Here is Richard Stallmans  (2 hours) talk:</span></strong></p>
<p><span id="more-483"></span><br />
<a href="http://www.r-statistics.com/wp-content/uploads/podcasts/Richard%20Stallman%20speach%20at%20useR2010%20-%20Talk.ogg"><strong>Audio file to download &#8211; Richard Stallman talk at the useR! 2010 conference</strong> (~2 hours)</a><br />
<audio src="http://www.r-statistics.com/wp-content/uploads/podcasts/Richard%20Stallman%20speach%20at%20useR2010%20-%20Talk.ogg"></audio></p>
<p><strong><span style="text-decoration: underline;">The second part of the talk</span></strong> consisted of Richard Stallman answering the following questions:</p>
<ul>
<li>What are your thoughts about<strong> Data portability?</strong></li>
<li>What are your thoughts about <strong>FaceBook</strong>?</li>
<li>Isn&#8217;t it a problem that free software doesn&#8217;t create <strong>wealth</strong>?</li>
<li>What are your thoughts about <strong>innovation</strong>?</li>
<li>What are your thoughts about Software as service (a.k.a: <strong>cloud computing</strong>)?</li>
<li>How can we defend your open sourced software from &#8220;<strong>hackers</strong>&#8220;?</li>
<li>What are your thoughts about <strong>google</strong>s products and services?</li>
<li>What are your thoughts about the legality/ethically of people changing from<strong> GPL to closed-sourced</strong>?</li>
<li>How can a programmer be &#8220;<strong>compensated</strong>&#8221; for his contribution for a free &#8220;open source&#8221; software?</li>
<li>What are your thoughts about &#8220;free <strong>games</strong>&#8220;?</li>
<li>What are your thoughts about <strong>search</strong> results?</li>
<li>What are your thoughts about Taxes and <strong>government </strong>responsibility for the use of free software?</li>
</ul>
<p><a href="http://www.r-statistics.com/wp-content/uploads/podcasts/Richard%20Stallman%20speach%20at%20useR2010%20-%20QA.ogg"><strong>Audio file to download &#8211; Richard Stallman talk at the useR! 2010 conference &#8211; Q&#038;A session</strong> (~25 minutes)</a></p>
<p><audio src="http://www.r-statistics.com/wp-content/uploads/podcasts/Richard%20Stallman%20speach%20at%20useR2010%20-%20QA.ogg"></audio></p>
<p>Final note, more talks from the useR2010 conference are expected to be put online <a href="http://www.vcasmo.com/user/drewconway">here</a>, thanks to <a href="http://www.drewconway.com/zia/?p=2221">Drew Conway</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.r-statistics.com/2010/07/richard-stallman-talkqa-at-the-user-2010-conference-audio-files-attached/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
<enclosure url="http://dl.dropbox.com/u/5371432/WebSites/R-statistics.com/audio/Richard%20Stallman%20speach%20at%20useR2010%20-%20Talk.mp3" length="118416194" type="audio/mpeg" />
<enclosure url="http://www.r-statistics.com/" length="0" type="Array" />
<enclosure url="http://dl.dropbox.com/u/5371432/WebSites/R-statistics.com/audio/Richard%20Stallman%20speach%20at%20useR2010%20-%20QA.mp3" length="24545906" type="audio/mpeg" />
		</item>
		<item>
		<title>Want to join the closed BETA of a new Statistical Analysis Q&amp;A site &#8211; NOW is the time!</title>
		<link>http://www.r-statistics.com/2010/07/want-to-join-the-closed-beta-of-a-new-statistical-analysis-qa-site-now-is-the-time/</link>
		<comments>http://www.r-statistics.com/2010/07/want-to-join-the-closed-beta-of-a-new-statistical-analysis-qa-site-now-is-the-time/#comments</comments>
		<pubDate>Fri, 16 Jul 2010 07:06:56 +0000</pubDate>
		<dc:creator>Tal Galili</dc:creator>
				<category><![CDATA[R]]></category>
		<category><![CDATA[R community]]></category>
		<category><![CDATA[statistics]]></category>
		<category><![CDATA[communites]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[online]]></category>
		<category><![CDATA[Q&A]]></category>
		<category><![CDATA[statistical analysis]]></category>

		<guid isPermaLink="false">http://www.r-statistics.com/?p=474</guid>
		<description><![CDATA[The bottom line of this post is for you to go to: Stack Exchange Q&#038;A site proposal: Statistical Analysis And commit yourself to using the website for asking and answering questions. (And also consider giving the contender, MetaOptimize a visit) * * * * Statistical analysis Q&#038;A website is about to go into BETA A month ago I invited readers of this blog to commit to using a new Q&#038;A website for Data-Analysis (based on StackOverFlow engine), once it will [...]]]></description>
			<content:encoded><![CDATA[<p><strong>The bottom line of this post is for you to go to:<br />
<a href="http://area51.stackexchange.com/proposals/33/statistical-analysis?referrer=3OUOcMUJcOo1">Stack Exchange Q&#038;A site proposal: Statistical Analysis </a><br />
And commit yourself to using the website for asking and answering questions.</strong></p>
<p>(And also consider giving the contender, <a href="http://metaoptimize.com/qa">MetaOptimize</a> a visit)</p>
<p>* * * * </p>
<h3>Statistical analysis Q&#038;A website is about to go into BETA</h3>
<p>A month ago I <a href="http://www.r-statistics.com/2010/06/a-new-qa-website-for-data-analysis-based-on-stackoverflow-engine-is-waiting-for-you/">invited readers of this blog to commit to using a new Q&#038;A website for Data-Analysis</a> (based on StackOverFlow engine), once it will open (the site was originally proposed by <a href="http://robjhyndman.com/researchtips/">Rob Hyndman</a>).<br />
And now, a month later, I am happy to write that <strong>over 500 people</strong> have shown interest in the website, and choose to commit themselves.  This means we we have reached 100% completion of the website proposal process, and in the next few days we will move to the next step.</p>
<p>The next step is that the website will go into closed BETA for about a week.  If you want to be part of this &#8211; now is <a href="http://area51.stackexchange.com/proposals/33/statistical-analysis?referrer=3OUOcMUJcOo1">the time to join</a> (<--- call for action people).<br />
From being part in some other closed BETA of similar projects, I can attest that the enthusiasm of the people trying to answer questions in the BETA is very impressive, so I strongly recommend the experience.</p>
<p>If you won't make it by the time you see this post, then no worries - about a week or so after the website will go online, it will be open to the wide public.</p>
<p>(p.s: thanks Romunov for pointing out to me that the BETA is about to open)</p>
<h3>p.s: MetaOptimize</h3>
<p>I would like to finish this post with mentioning <a href="http://metaoptimize.com/qa/">MetaOptimize</a>.   This is a Q&#038;A website which is of a more &#8220;machine learning&#8221; then a &#8220;statistical&#8221; community.  It also started out some short while ago, and already it has <a href="http://metaoptimize.com/qa/users/">around 700 users</a> who have submitted ~160 questions with ~520 answers given.  From my experience on the site so far, I have enjoyed the high quality of the questions and answers.<br />
When I first came by the website, I feared that supporting this website will split the R community of users between this website and the <a href="http://area51.stackexchange.com/proposals/33/statistical-analysis?referrer=3OUOcMUJcOo1">area 51 StackExchange website</a>.<br />
But after a lengthy discussion (<a href="http://www.r-statistics.com/2010/07/statistical-analysis-qa-website-did-stackoverflow-just-lose-it-to-metaoptimize-and-is-it-good-or-bad/">published recently as a post</a>) with MetaOptimize founder, Joseph Turian, I came to have a more optimistic view of the competition of the two websites.  Where at first I was afraid, I am now <strong>hopeful</strong> that each of the two website will manage to draw a tiny bit of different communities of people (that would otherwise wouldn&#8217;t be present in the other website) &#8211; thus offering all of us a wider variety of knowledge to tap into.</p>
<p>See you there&#8230;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.r-statistics.com/2010/07/want-to-join-the-closed-beta-of-a-new-statistical-analysis-qa-site-now-is-the-time/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>New versions for ggplot2 (0.8.8) and plyr (1.0) were released today</title>
		<link>http://www.r-statistics.com/2010/07/released-today-new-versions-for-ggplot2-0-8-8-and-plyr-1-0/</link>
		<comments>http://www.r-statistics.com/2010/07/released-today-new-versions-for-ggplot2-0-8-8-and-plyr-1-0/#comments</comments>
		<pubDate>Tue, 06 Jul 2010 07:32:11 +0000</pubDate>
		<dc:creator>Tal Galili</dc:creator>
				<category><![CDATA[R]]></category>
		<category><![CDATA[ggplot2]]></category>
		<category><![CDATA[Hadley Wickham]]></category>
		<category><![CDATA[news]]></category>
		<category><![CDATA[plyr]]></category>
		<category><![CDATA[update]]></category>

		<guid isPermaLink="false">http://www.r-statistics.com/?p=459</guid>
		<description><![CDATA[As prolific as the CRAN website is of packages, there are several packages to R that succeeds in standing out for their wide spread use (and quality), Hadley Wickhams ggplot2 and plyr are two such packages. And today (through twitter) Hadley has updates the rest of us with the news: just released new versions of plyr and ggplot2. source versions available on cran, compiled will follow soon #rstats Going to the CRAN website shows that plyr has gone through the [...]]]></description>
			<content:encoded><![CDATA[<p>As prolific as the CRAN website is of packages, there are several packages to R that succeeds in standing out for their wide spread use (and quality), <a href="http://had.co.nz/">Hadley Wickhams </a><a href="http://had.co.nz/ggplot2/">ggplot2 </a>and <a href="http://had.co.nz/plyr/">plyr </a>are two such packages.<br />
<img src="http://had.co.nz/plyr/pliers.jpg" alt="plyr image" /><br />
And today (<a href="http://twitter.com/hadleywickham/status/17814050267">through twitter</a>) Hadley has updates the rest of us with the news:</p>
<blockquote><p>just released new versions of plyr and ggplot2. source versions available on cran, compiled will follow soon #rstats</p></blockquote>
<p>Going to the CRAN website shows that plyr has gone through the most major update, with the last update (before the current one) taking place on 2009-06-23.  And now, over a year later, we are presented with plyr version 1, which includes New functions, New features some Bug fixes and a much anticipated Speed improvements.<br />
ggplot2, has made a tiny leap from version 0.8.7 to 0.8.8, and was previously last updated on 2010-03-03.</p>
<p>Me, and I am sure many R users are very thankful for the amazing work that Hadley Wickham is doing (both on his code, and with helping other useRs on the help lists).  So Hadley, <strong>thank you</strong>!</p>
<p>Here is the complete change-log list for both packages:<br />
<span id="more-459"></span></p>
<h3>plyr 1.0 (2010-07-02) &#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;</h3>
<p>(taken from <a href="http://cran.r-project.org/web/packages/plyr/NEWS">the CRAN website</a>)<br />
<strong> New functions:</strong></p>
<p>* arrange, a new helper method for reordering a data frame.<br />
* count, a version of table that returns data frames immediately and that is<br />
much much faster for high-dimensional data.<br />
* desc makes it easy to sort any vector in descending order<br />
* join, works like merge but can be much faster and has a somewhat simpler<br />
syntax drawing from SQL terminology<br />
* rbind.fill.matrix is like rbind.fill but works for matrices, code<br />
contributed by C. Beleites</p>
<p><strong>Speed improvements</strong></p>
<p>* experimental immutable data frame (idata.frame) that vastly speeds up<br />
subsetting &#8211; for large datasets with large numbers of groups, this can yield<br />
10-fold speed ups. See examples in ?idata.frame to see how to use it.<br />
* rbind.fill rewritten again to increase speed and work with more data types<br />
* d*ply now much faster with nested groups</p>
<p><strong>New features:</strong></p>
<p>* d*ply now accepts NULL for splitting variables, indicating that the data<br />
should not be split<br />
* plyr no longer exports internal functions, many of which were causing<br />
clashes with other packages<br />
* rbind.fill now works with data frame columns that are lists or matrices<br />
* test suite ensures that plyr behaviour is correct and will remain correct<br />
as I make future improvements.</p>
<p><strong>Bug fixes:</strong></p>
<p>* **ply: if zero splits, empty list(), data.frame() or logical() returned,<br />
as appropriate for the output type<br />
* **ply: leaving .fun as NULL now always returns list<br />
(thanks to Stavros Macrakis for the bug report)<br />
* a*ply: labels now respect options(stringAsFactors)<br />
* each: scoping bug fixed, thanks to Yasuhisa Yoshida for the bug report<br />
* list_to_dataframe is more consistent when processing a single data frame<br />
* NAs preserved in more places<br />
* progress bars: guaranteed to terminate even if **ply prematurely terminates<br />
* progress bars: misspelling gives informative warning, instead of<br />
uninformative error<br />
* splitter_d: fixed ordering bug when .drop = FALSE</p>
<h3>ggplot2 0.8.8 (2010-07-02) &#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;-</h3>
<p>(taken from <a href="http://cran.r-project.org/web/packages/ggplot2/NEWS">the CRAN website</a>)</p>
<p><strong>Bug fixes:</strong></p>
<p>* coord_equal finally works as expected (thanks to continued prompting from Jean-Olivier Irisson)<br />
* coord_equal renamed to coord_fixed to better represent capabilities<br />
* coord_polar and coord_polar: new munching system that uses distances (as defined by the coordinate system) to figure out how many pieces each segment should be broken in to (thanks to prompting from Jean-Olivier Irisson)<br />
* fix ordering bug in facet_wrap (thanks to bug report by Frank Davenport)<br />
* geom_errorh correctly responds to height parameter outside of aes<br />
* geom_hline and geom_vline will not impact legend when used for fixed intercepts<br />
* geom_hline/geom_vline: intercept values not set quite correctly which caused a problem in conjunction with transformed scales (reported by Seth Finnegan)<br />
* geom_line: can now stack lines again with position = &#8220;stack&#8221; (fixes #74)<br />
* geom_segment: arrows now preserved in non-Cartesian coordinate system (fixes #117)<br />
* geom_smooth now deals with missing values in the same way as geom_line (thanks to patch from Karsten Loesing)<br />
* guides: check all axis labels for expressions (reported by Benji Oswald)<br />
* guides: extra 0.5 line margin around legend (fixes #71)<br />
* guides: non-left legend positions now work once more (thanks to patch from Karsten Loesing)<br />
* label_bquote works with more expressions (factors now cast to characters, thanks to Baptiste Auguie for bug report)<br />
* scale_color: add missing US spellings<br />
* stat: panels with no non-missing values trigged errors with some statistics. (reported by Giovanni Dall&#8217;Olio)<br />
* stat: statistics now also respect layer parameter inherit.aes (thanks to bug report by Lorenzo Isella and investigation by Brian Diggs)<br />
* stat_bin no longer drops 0-count bins by default<br />
* stat_bin: fix small bug when dealing with single bin with NA position (reported by John Rauser)<br />
* stat_binhex: uses range of data from scales when computing binwidth so hexes are the same size in all facets (thanks to Nicholas Lewin-Koh for the bug report)<br />
* stat_qq has new dparam parameter for specifying distribution parameters (thanks to Yunfeng Zhang for the bug report)<br />
* stat_smooth now uses built-in confidence interval (with small sample correction) for linear models (thanks to suggestion by Ian Fellows)<br />
* sta</p>
]]></content:encoded>
			<wfw:commentRss>http://www.r-statistics.com/2010/07/released-today-new-versions-for-ggplot2-0-8-8-and-plyr-1-0/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>StackOverFlow and MetaOptimize are battling to be the #1 &#8220;Statistical Analysis Q&amp;A website” &#8211; to whom would you signup?</title>
		<link>http://www.r-statistics.com/2010/07/statistical-analysis-qa-website-did-stackoverflow-just-lose-it-to-metaoptimize-and-is-it-good-or-bad/</link>
		<comments>http://www.r-statistics.com/2010/07/statistical-analysis-qa-website-did-stackoverflow-just-lose-it-to-metaoptimize-and-is-it-good-or-bad/#comments</comments>
		<pubDate>Fri, 02 Jul 2010 21:55:05 +0000</pubDate>
		<dc:creator>Tal Galili</dc:creator>
				<category><![CDATA[R community]]></category>
		<category><![CDATA[statistics]]></category>
		<category><![CDATA[area51]]></category>
		<category><![CDATA[artificial intelligence]]></category>
		<category><![CDATA[data mining]]></category>
		<category><![CDATA[data visualization]]></category>
		<category><![CDATA[information retrieval]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[natural language processing]]></category>
		<category><![CDATA[Q&A website]]></category>
		<category><![CDATA[search]]></category>
		<category><![CDATA[stack exchange]]></category>
		<category><![CDATA[stackoverflow]]></category>
		<category><![CDATA[statistical modeling]]></category>
		<category><![CDATA[text analysis]]></category>

		<guid isPermaLink="false">http://www.r-statistics.com/?p=442</guid>
		<description><![CDATA[A new statistical analysis Q&#38;A website launched While the proposal for a statistical analysis Q&#38;A website on area51 (stackexchange) is taking it&#8217;s time, and the website is still collecting people who will commit to it, Joseph Turian, who seems a nice guy from his various comments online, seem to feel this website is not what the community needs and that we shouldn&#8217;t hold up on our questions for the website to go online. Therefore, Joseph is pushing with all his [...]]]></description>
			<content:encoded><![CDATA[<h3>A new statistical analysis Q&amp;A website launched</h3>
<p>While <a href="http://bit.ly/aDuRKV">the proposal for a statistical analysis Q&amp;A website</a> on area51 (stackexchange) is taking it&#8217;s time, and the website is still collecting people who will commit to it,<br />
<a href="http://www-etud.iro.umontreal.ca/~turian/">Joseph Turian</a>, who seems a nice guy from his various comments online, seem to feel this website is not what the community needs and that we shouldn&#8217;t hold up on our questions for the website to go online.  Therefore, Joseph is pushing with all his might his newest creation &#8220;<a href="http://metaoptimize.com/qa">MetaOptimize QA</a>&#8220;, a <a href="http://StackOverFlow.com">StackOverFlow </a>like website for (long list follows): <em>machine learning, natural language processing, artificial intelligence, text analysis, information retrieval, search, data mining, statistical modeling, and data visualization</em>.<br />
With all the bells and whistles that the <a href="http://www.osqa.net/">OSQA framework</a> (an open source stackoverflow clone, and more, system) can offer (you know, rankings, badges and so on).</p>
<p>Is this new website better then the area51 website?  Will all the people go to just one of the two websites. or will we end up with two places that attracts more people then we had to begin with?  These are the questions that come to mind when faced with the story in front of us.</p>
<p>My own suggestion is to try both websites (<a href="http://bit.ly/aDuRKV">the stackoverflow statistical analysis website to come</a> and &#8220;<a href="http://metaoptimize.com/qa">MetaOptimize QA</a>&#8220;) and let time tell.</p>
<p>More info on this story bellow.</p>
<h3>MetaOptimize online impact so far</h3>
<p>The need for such a Q&amp;A site is clearly evident.  With just several days after being promoted online, MetaOptimize has claimed the eyes of almost 300 users, submitting 59 questions and 129 answers.<br />
Already many bloggers in the statistical community have contributed their voices with encouraging posts, here is just a collection of the post I was able to find with some googling:</p>
<ul>
<li><a href="http://hunch.net/?p=1425">http://hunch.net/?p=1425</a></li>
<li><a href="http://ebiquity.umbc.edu/blogger/2010/06/30/training-examples-qa-stackoverflow-for-nlp-and-ml/">http://ebiquity.umbc.edu/blogger/2010/06/30/training-examples-qa-stackoverflow-for-nlp-and-ml/</a></li>
<li><a href="http://lingpipe-blog.com/2010/06/29/training-examples-a-stack-overflow-for-nlp-and-ml-and/">http://lingpipe-blog.com/2010/06/29/training-examples-a-stack-overflow-for-nlp-and-ml-and/</a></li>
<li><a href="http://www.stat.columbia.edu/~cook/movabletype/archives/2010/06/question_answer.html">http://www.stat.columbia.edu/~cook/movabletype/archives/2010/06/question_answer.html</a></li>
<li><a href="http://kaggle.com/blog/2010/07/02/new-machine-learning-and-natural-language-processing-qa-site/">http://kaggle.com/blog/2010/07/02/new-machine-learning-and-natural-language-processing-qa-site/</a></li>
<li><a href="http://www.jroller.com/otis/entry/metaoptimize_com_q_a_site">http://www.jroller.com/otis/entry/metaoptimize_com_q_a_site</a></li>
<li><a href="http://sbseminar.wordpress.com/2010/06/17/statistics-version-of-mathoverflow-looking-for-beta-testers/">http://sbseminar.wordpress.com/2010/06/17/statistics-version-of-mathoverflow-looking-for-beta-testers/</a></li>
<li><a href="http://myumbc3.my.umbc.edu/news/1841">http://myumbc3.my.umbc.edu/news/1841</a></li>
<li><a href="http://ebiquity.umbc.edu/blogger/2010/06/30/training-examples-qa-stackoverflow-for-nlp-and-ml/">http://ebiquity.umbc.edu/blogger/2010/06/30/training-examples-qa-stackoverflow-for-nlp-and-ml/</a></li>
</ul>
<h3>But is it goos to have two websites?</h3>
<p>But wait, didn&#8217;t we just start pushing forward another <a href="http://www.r-statistics.com/2010/06/a-new-qa-website-for-data-analysis-based-on-stackoverflow-engine-is-waiting-for-you/">statistical Q&amp;A website two weeks ago</a>?  I am talking about the <strong><a href="http://bit.ly/aDuRKV">Stack Exchange Q&amp;A site proposal: Statistical Analysis</a>.</strong></p>
<p>So what should we (the community of statistical minded people) to do the next time we have a question?</p>
<p>Should we wait for Stack Exchange offer for a new website to start?  Or should we start using MetaOptimize?</p>
<p><strong>Update: <span style="font-weight: normal;">after lengthy e-mail exchange with Joseph (the person who founded MetaOptimize), I decided to erase what I originally wrote as my doubts, and instead give a Q&amp;A session that him and I have had in the e-mails exchange.  It is a bit edited from what was originally, and some of the content will probably get updated &#8211; so if you are into this subject, check in again in a few hours <img src='http://www.r-statistics.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </span></strong></p>
<p><del datetime="2010-07-03T09:28:16+00:00"><br />
Honestly, I am split in two (and <a href="http://www-etud.iro.umontreal.ca/~turian/">Joseph</a>, I do hope you&#8217;ll take this in a positive way, since personally I feel confident you are a good guy).  I very strongly believe in the need and value of such a Q&amp;A website.  Yet I am wondering how I feel about such a website being hosted as MetaOptimize and outside the hands of the stackoverflow guys.<br />
On the one hand, open source lovers (like myself) tend to like decentralization and reliance on OSS (open source software) solutions (such as the one <a href="http://www.osqa.net/">OSQA framework</a> offers).  On the other hand, I do believe that the stackoverflow people  have (much) more experience in handling such websites then Joseph.  I can very easily trust them to do regular database backups, share the websites database dumps with the general community, smoothly test and upgrade to provide new features, and generally speaking perform in a more  experienced way with the online Q&amp;A community.<br />
It doesn&#8217;t mean that Joseph won&#8217;t do a great job, personally I hope he will.</del></p>
<h3><strong><span style="text-decoration: underline;">Q&amp;A session with Joseph Turian (MetaOptimize founder)</span></strong></h3>
<p><strong><span style="text-decoration: underline;">Tal</span></strong>: Let&#8217;s start with the easy question, should I worry about technical issues in the website (like, for example, backups)?</p>
<p><span style="text-decoration: underline;"><strong>Joseph</strong></span>:</p>
<div id="_mcePaste">The OSQA team (backed by DZone) have got my back. They have been very helpful since day one to all OSQA users, and have given me a lot of support. Thanks, especially Rick and Hernani!</div>
<p>They provide email and chat support for OSQA users.</p>
<p>I will commit to putting up regular automatic database dumps, whenever the OSQA team implements it:<br />
<a href="http://meta.osqa.net/questions/3120/how-do-i-offer-database-dumps">http://meta.osqa.net/questions/3120/how-do-i-offer-database-dumps</a><br />
If, in six months, they don&#8217;t have this feature as part of their core, and someone (e.g. you) emails me reminding me that they want a dump, I will manually do a database dump and strip the user table.</p>
<p>Also, I&#8217;ve got a scheduled daily database dump that is mirrored to Amazon S3.</p>
<p><span style="text-decoration: underline;"><strong><strong><span style="text-decoration: underline;">Tal</span></strong>:</strong></span> Why did you start MetaOptimize instead of supporting the area51 proposal?<br />
<span style="text-decoration: underline;"><strong>Joseph</strong></span>:</p>
<ol>
<li><span style="font-size: 13.1944px;">On Area51, people asked to have AI merged with ML, and ML merged with statistical analysis, but their requests seemed to be ignored. This seemed like a huge disservice to these communities.</span></li>
<li><span style="font-size: 13.1944px;">Area 51 didn&#8217;t have academics in ML + NLP. I know from experience it&#8217;s hard to get them to buy in to new technology. So why would I risk my reputation getting them to sign up for Area 51, when I know that I will get a 1% conversion? They aren&#8217;t early adopters interested in the process, many are late adopters who won&#8217;t sign up for something until they have too.</span></li>
<li><span style="font-size: 13.1944px;">If the Area 51 sites had a strong newbie bent, which is what it seemed like the direction was going, then the academic experts definitely wouldn&#8217;t waste their time. It would become a support<br />
</span><span style="font-size: 13.1944px;">community for newbies, without core expert discussion.  So basically, I know that I and a lot of my colleagues wanted the site I built. And I felt like area 51 was shaping the communities really incorrectly in several respects, and was also taking a while.  I could have fought an institutional process and maybe gotten half the results above and it took a few months, or I could just build the site and invite my friends, and shape the community correctly.</span></li>
</ol>
<p>Besides that, there are also personal motives:</p>
<ul>
<li><span style="font-size: 13.1944px;">I wanted the recognition for having a good vision for the community, and driving forward something they really like.</span></li>
<li><span style="font-size: 13.1944px;">I wanted to experiment with some NLP and ML extensions for the Q+A software, to help organize the information better. Not possible on a closed platform.</span></li>
</ul>
<p><span style="text-decoration: underline;"><strong><strong><span style="text-decoration: underline;">Tal</span></strong>:</strong></span> Me (and maybe some other people) fear that this might fork the people in the field to two websites, instead of bringing them together.  What are your thoughts about that?<br />
<span style="text-decoration: underline;"><strong>Joseph</strong></span>:<br />
How am I forking the community? I&#8217;m bringing a bunch of people in who wouldn&#8217;t have even been part of the Area 51 community.<br />
Area 51 was going to fork it into five communities: stat analysis, ML, NLP, AI, and data mining.  And then a lot fewer people would have been involved.</p>
<p><span style="text-decoration: underline;"><strong><strong><span style="text-decoration: underline;">Tal</span></strong>:</strong></span> What are the things that people who support your website are saying?<br />
<span style="text-decoration: underline;"><strong>Joseph</strong></span>:<br />
Here are some quotes about my site:</p>
<blockquote><p>Philip Resnick (UMD): &#8220;Looking at the questions being asked, the people responding, and the quality of the discussion, I can already see this becoming the go-to place for those &#8216;under the hood&#8217; details<br />
you rarely see in the textbooks or conference papers. This site is going to save a lot of people an awful lot of time and frustration.&#8221;</p>
<p>Aria Haghighi (Berkeley): &#8220;Both NLP and ML have a lot of folk wisdom about what works and what doesn&#8217;t. A site like this is crucial for facilitating the sharing and validation of this collective knowledge.&#8221;</p>
<p>Alexandre Passos (Unicamp): &#8220;Really thank you for that. As a machine learning phd student from somewhere far from most good research centers (I&#8217;m in brazil, and how many brazillian ML papers have you<br />
seen in NIPS/ICML recently?), I struggle a lot with this folk wisdom. Most professors around here haven&#8217;t really interacted enough with the international ML community to be up to date&#8221;<br />
(http://news.ycombinator.com/item?id=1476247)</p>
<p>Ryan McDonald (Google): &#8220;A tool like this will help disseminate and archive the tricks and best practices that are common in NLP/ML, but are rarely written about at length in papers.&#8221;</p>
<p>esoom on Reddit: &#8220;This is awesome. I&#8217;m really impressed by the quality of some of the answers, too. Within five minutes of skimming the site, I learned a neat trick that isn&#8217;t widely discussed in the literature.&#8221;<br />
(http://www.reddit.com/r/MachineLearning/comments/ckw5k/stackoverflow_for_machine_learning_and_natural/c0tb3gc)</p>
<p><span style="text-decoration: underline;"><strong><strong><span style="text-decoration: underline;">Tal</span></strong>:</strong></span> In order to be fair to area51 work, they have gotten wonderful responses for the &#8220;statistical analysis&#8221; proposal as well (<a href="http://bit.ly/aDuRKV">see it here</a>)<br />
I have also contacted area51 directly and asked them and invited them to come and join the discussion.  I&#8217;ll update this post with their reply.</p></blockquote>
<h3><span style="text-decoration: underline;">So what&#8217;s next?</span></h3>
<p><del datetime="2010-07-03T08:08:02+00:00">I don&#8217;t know.<br />
If the Stack Exchange website where to launch today, I would probably focus on using it and hint to the site for MetaOptimize (for the reasons I just mentioned, and also for some that Rob Hyndman maintained when he <a href="http://robjhyndman.com/researchtips/stack-exchange-for-statistical-analysis-needs-you/">first wrote on the subject</a>).<br />
If the stack exchange version of the website where to start in a few weeks, I would probably sit on the fence and see if people are using it.  I suspect that by that time, there wouldn&#8217;t be many people left to populate it (but I could always be wrong).<br />
And what if the website where to start in a week, what then?  I have no clue.</del><br />
Good question.<br />
My current feeling is that I am glad to let this play out.<br />
It seems this is a good case study for some healthy competition between platforms and models (OSQA vs stackoverflow/area51-system) &#8211; one that I hope will generate more good features from both companies.  And also will make both parties work hard to get people to participate.<br />
It also seems that this situation is getting many people in our field to be approached with the same idea (Q&amp;A website).  After Joseph input on the subject, I am starting to think that maybe at the end of the day this will benefit all of us.  Instead of forking one community into two, maybe what we&#8217;ll end up with is getting more (experienced) people online (into two locations) that would otherwise would have stayed in the shadows.</p>
<p>The verdict is still out, but I am a bit more optimistic than I was when first writing this post.  I&#8217;ll update this post after getting more input from people.</p>
<p>And as always &#8211; I would love to know <strong><span style="text-decoration: underline;">your thoughts</span></strong> on the subject.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.r-statistics.com/2010/07/statistical-analysis-qa-website-did-stackoverflow-just-lose-it-to-metaoptimize-and-is-it-good-or-bad/feed/</wfw:commentRss>
		<slash:comments>19</slash:comments>
		</item>
		<item>
		<title>Visualization of regression coefficients (in R)</title>
		<link>http://www.r-statistics.com/2010/07/visualization-of-regression-coefficients-in-r/</link>
		<comments>http://www.r-statistics.com/2010/07/visualization-of-regression-coefficients-in-r/#comments</comments>
		<pubDate>Fri, 02 Jul 2010 19:46:56 +0000</pubDate>
		<dc:creator>Tal Galili</dc:creator>
				<category><![CDATA[R]]></category>
		<category><![CDATA[statistics]]></category>
		<category><![CDATA[visualization]]></category>
		<category><![CDATA[coefficients]]></category>
		<category><![CDATA[Coefficients Visualization]]></category>
		<category><![CDATA[graph]]></category>
		<category><![CDATA[plot]]></category>
		<category><![CDATA[regression]]></category>
		<category><![CDATA[regression plot]]></category>
		<category><![CDATA[regression Visualization]]></category>

		<guid isPermaLink="false">http://www.r-statistics.com/?p=435</guid>
		<description><![CDATA[Update (07.07.10): The function in this post has a more mature version in the &#8220;arm&#8221; package. See at the end of this post for more details. * * * * Imagine you want to give a presentation or report of your latest findings running some sort of regression analysis. How would you do it? This was exactly the question Wincent Rong-gui HUANG has recently asked on the R mailing list. One person, Bernd Weiss, responded by linking to the chapter [...]]]></description>
			<content:encoded><![CDATA[<p><strong>Update (07.07.10)</strong>: The function in this post has a more mature version in the &#8220;arm&#8221; package.  See at the end of this post for more details.<br />
* * * *</p>
<p>Imagine you want to give a presentation or report of your latest findings running some sort of regression analysis.  How would you do it?</p>
<p>This was exactly the question Wincent Rong-gui HUANG has recently asked <a href="http://r.789695.n4.nabble.com/Visualization-of-coefficients-tt2276010.html#none">on the R mailing list</a>.</p>
<p>One person, Bernd Weiss, responded by linking to the chapter &#8220;<a href="http://tables2graphs.com/doku.php?id=04_regression_coefficients">Plotting Regression Coefficients</a>&#8221; on an interesting online book (I have never heard of before) called &#8220;<a href="http://tables2graphs.com/doku.php">Using Graphs Instead of Tables</a>&#8221; (I should add this link to the <a href="http://www.r-statistics.com/2009/10/free-statistics-e-books-for-download/">free statistics e-books list</a>&#8230;)</p>
<p>Letter in the conversation, <a href="http://statmath.wu.ac.at/~zeileis/">Achim Zeileis</a>, has surprised us (well, me) saying the following</p>
<blockquote><p>I&#8217;ve thought about adding a plot() method for the coeftest() function in the <a href="http://cran.r-project.org/web/packages/lmtest/index.html">&#8220;lmtest&#8221; package</a>. Essentially, it relies on a coef() and a vcov() method being available &#8211; <strong>and that a central limit theorem holds</strong>. For releasing it as a general function in the package the code is still too raw, but maybe it&#8217;s useful for someone on the list. Hence,<strong> I&#8217;ve included it below</strong>.</p></blockquote>
<p> (I allowed myself to add some <strong>bolds</strong> in the text)</p>
<p>So for the convenience of all of us, I uploaded Achim&#8217;s code in a file for easy access.  Here is an example of how to use it:</p>

<div class="wp_syntax"><div class="code"><pre class="rsplus" style="font-family:monospace;"><span style="color: #0000FF; font-weight: bold;">source</span><span style="color: #080;">&#40;</span><span style="color: #ff0000;">&quot;http://www.r-statistics.com/wp-content/uploads/2010/07/coefplot.r.txt&quot;</span><span style="color: #080;">&#41;</span>
&nbsp;
<span style="color: #0000FF; font-weight: bold;">data</span><span style="color: #080;">&#40;</span><span style="color: #ff0000;">&quot;Mroz&quot;</span>, package <span style="color: #080;">=</span> <span style="color: #ff0000;">&quot;car&quot;</span><span style="color: #080;">&#41;</span>
fm <span style="color: #080;">&lt;-</span> <span style="color: #0000FF; font-weight: bold;">glm</span><span style="color: #080;">&#40;</span>lfp ~ ., <span style="color: #0000FF; font-weight: bold;">data</span> <span style="color: #080;">=</span> Mroz, <span style="color: #0000FF; font-weight: bold;">family</span> <span style="color: #080;">=</span> <span style="color: #0000FF; font-weight: bold;">binomial</span><span style="color: #080;">&#41;</span>
coefplot<span style="color: #080;">&#40;</span>fm, parm <span style="color: #080;">=</span> <span style="color: #080;">-</span><span style="color: #ff0000;">1</span><span style="color: #080;">&#41;</span></pre></div></div>

<p>Here is the resulting graph:<br />
<a href="http://www.r-statistics.com/wp-content/uploads/2010/07/regression-coefficient-plot.png"><img src="http://www.r-statistics.com/wp-content/uploads/2010/07/regression-coefficient-plot.png" alt="" title="regression coefficient plot" width="550" class="alignright size-full wp-image-437" /></a></p>
<p>I hope Achim will get around to improve the function so he might think it worthy of joining his<a href="http://cran.r-project.org/web/packages/lmtest/index.html">&#8220;lmtest&#8221; package</a>.  I am glad he shared his code for the rest of us to have something to work with in the meantime <img src='http://www.r-statistics.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>* * *</p>
<p><strong>Update (07.07.10)</strong>:<br />
Thanks to a comment by David Atkins, I found out there is a more mature version of this function (called <strong>coefplot</strong>) inside the {arm} package.  This version offers many features, one of which is the ability to easily stack several confidence intervals one on top of the other.</p>
<p>It works for baysglm, glm, lm, polr objects and a default method is available which takes pre-computed coefficients and associated standard errors from any suitable model.</p>
<p><strong>Example:</strong><br />
(Notice that the Poisson model in comparison with the binomial models does not make much sense, but is enough to illustrate the use of the function)</p>

<div class="wp_syntax"><div class="code"><pre class="rsplus" style="font-family:monospace;"><span style="color: #0000FF; font-weight: bold;">library</span><span style="color: #080;">&#40;</span><span style="color: #ff0000;">&quot;arm&quot;</span><span style="color: #080;">&#41;</span>
<span style="color: #0000FF; font-weight: bold;">data</span><span style="color: #080;">&#40;</span><span style="color: #ff0000;">&quot;Mroz&quot;</span>, package <span style="color: #080;">=</span> <span style="color: #ff0000;">&quot;car&quot;</span><span style="color: #080;">&#41;</span>
M1<span style="color: #080;">&lt;-</span>      <span style="color: #0000FF; font-weight: bold;">glm</span><span style="color: #080;">&#40;</span>lfp ~ ., <span style="color: #0000FF; font-weight: bold;">data</span> <span style="color: #080;">=</span> Mroz, <span style="color: #0000FF; font-weight: bold;">family</span> <span style="color: #080;">=</span> <span style="color: #0000FF; font-weight: bold;">binomial</span><span style="color: #080;">&#41;</span>
M2<span style="color: #080;">&lt;-</span> bayesglm<span style="color: #080;">&#40;</span>lfp ~ ., <span style="color: #0000FF; font-weight: bold;">data</span> <span style="color: #080;">=</span> Mroz, <span style="color: #0000FF; font-weight: bold;">family</span> <span style="color: #080;">=</span> <span style="color: #0000FF; font-weight: bold;">binomial</span><span style="color: #080;">&#41;</span>
M3<span style="color: #080;">&lt;-</span>      <span style="color: #0000FF; font-weight: bold;">glm</span><span style="color: #080;">&#40;</span>lfp ~ ., <span style="color: #0000FF; font-weight: bold;">data</span> <span style="color: #080;">=</span> Mroz, <span style="color: #0000FF; font-weight: bold;">family</span> <span style="color: #080;">=</span> <span style="color: #0000FF; font-weight: bold;">binomial</span><span style="color: #080;">&#40;</span>probit<span style="color: #080;">&#41;</span><span style="color: #080;">&#41;</span>
coefplot<span style="color: #080;">&#40;</span>M2, xlim<span style="color: #080;">=</span><span style="color: #0000FF; font-weight: bold;">c</span><span style="color: #080;">&#40;</span><span style="color: #080;">-</span><span style="color: #ff0000;">2</span>, <span style="color: #ff0000;">6</span><span style="color: #080;">&#41;</span>,            intercept<span style="color: #080;">=</span>TRUE<span style="color: #080;">&#41;</span>
coefplot<span style="color: #080;">&#40;</span>M1, add<span style="color: #080;">=</span>TRUE, col.<span style="">pts</span><span style="color: #080;">=</span><span style="color: #ff0000;">&quot;red&quot;</span>,  intercept<span style="color: #080;">=</span>TRUE<span style="color: #080;">&#41;</span>
coefplot<span style="color: #080;">&#40;</span>M3, add<span style="color: #080;">=</span>TRUE, col.<span style="">pts</span><span style="color: #080;">=</span><span style="color: #ff0000;">&quot;blue&quot;</span>, intercept<span style="color: #080;">=</span>TRUE, <span style="color: #0000FF; font-weight: bold;">offset</span><span style="color: #080;">=</span><span style="color: #ff0000;">0.2</span><span style="color: #080;">&#41;</span></pre></div></div>

<p>(hat tip goes to Allan Engelhardt for help improving the code, and for Achim Zeileis in extending and improving the narration for the example)</p>
<p><strong>Resulting plot </strong></p>
<p><a href="http://www.r-statistics.com/wp-content/uploads/2010/07/coeff-visualization-3.png"><img src="http://www.r-statistics.com/wp-content/uploads/2010/07/coeff-visualization-3.png" alt="" title="coeff visualization 3" width="550" class="alignright size-full wp-image-471" /></a></p>
<p>* * *<br />
Lastly,  another method worth mentioning is the Nomogram, implemented by Frank Harrell&#8217;a <a href="http://biostat.mc.vanderbilt.edu/wiki/Main/Rrms">rms package</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.r-statistics.com/2010/07/visualization-of-regression-coefficients-in-r/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Contest: Road Traffic Prediction for Intelligent GPS Navigation</title>
		<link>http://www.r-statistics.com/2010/06/contest-road-traffic-prediction-for-intelligent-gps-navigation/</link>
		<comments>http://www.r-statistics.com/2010/06/contest-road-traffic-prediction-for-intelligent-gps-navigation/#comments</comments>
		<pubDate>Wed, 30 Jun 2010 17:25:38 +0000</pubDate>
		<dc:creator>Tal Galili</dc:creator>
				<category><![CDATA[R]]></category>
		<category><![CDATA[statistics]]></category>
		<category><![CDATA[competition]]></category>
		<category><![CDATA[contest]]></category>
		<category><![CDATA[GPS]]></category>
		<category><![CDATA[prize]]></category>
		<category><![CDATA[prizes]]></category>
		<category><![CDATA[Road Traffic prediction]]></category>

		<guid isPermaLink="false">http://www.r-statistics.com/?p=424</guid>
		<description><![CDATA[About prize baring contests Competition with prizes are an amazing thing. If you are not sure of that, I urge you to listened to Peter Diamandis talk about his experience with the X prize (start listening at minute 11:40): At short &#8211; prizes can give up to 1 to 50 ratio of return on investment of the people giving funding to the prize. The money is spent only when results are achieved. And there is a lot of value in [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.r-statistics.com/wp-content/uploads/2010/06/ICDM.jpg"><img src="http://www.r-statistics.com/wp-content/uploads/2010/06/ICDM.jpg" alt="" title="Red Sports Car" width="347" height="346" class="alignnone size-full wp-image-429" /></a></p>
<h3>About prize baring contests</h3>
<p>Competition with prizes are an amazing thing.  If you are not sure of that, I urge you to listened to Peter Diamandis talk about his experience with the X prize (<strong>start listening at minute 11:40</strong>):</p>
<p><object width="480" height="385"><param name="movie" value="http://www.youtube.com/v/sUOBLX55h4s&#038;hl=en_US&#038;fs=1&#038;start=700"></param><param name="allowFullScreen" value="true"></param><param name="allowscriptaccess" value="always"></param><embed src="http://www.youtube.com/v/sUOBLX55h4s&#038;hl=en_US&#038;fs=1&#038;" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="480" height="385"></embed></object></p>
<p>At short &#8211; prizes can give up to 1 to 50 ratio of return on investment of the people giving funding to the prize.  The money is spent only when results are achieved.  And there is a lot of value in terms of public opinion and publicity.  And the best of all (for the promoter of the competition) &#8211; prizes encourage people to take risks (at their own expense) in order to get results done.</p>
<p>All of that said, I look at prize baring competition as something worth spreading, especially in cases where the results of the winning team will be shared with the public.</p>
<h3>About the IEEE ICDM Contest</h3>
<p>The IEEE ICDM Contest (&#8220;Road Traffic Prediction for Intelligent GPS Navigation&#8221;), seems to be one of those cases.  Due to a polite request, I am republishing here the details of this new competition, in the hope that some of my R colleagues will bring the community some pride <img src='http://www.r-statistics.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /><br />
<span id="more-424"></span></p>
<h3>Introduction</h3>
<p>Data mining competition affiliated with IEEE International Conference on Data Mining 2010 (ICDM), Sydney, Australia, Dec 14-17. The task is to predict city traffic based on simulated historical measurements or real-time stream of notifications sent by individual drivers from their GPS navigators. Prizes worth $5,000 will be awarded to the winners.</p>
<p><img src="http://tunedit.org/download/ICDM/2010/varia/TSF.png" alt="Traffic map" /></p>
<h3>Road Traffic Prediction for Intelligent GPS Navigation</h3>
<p>Over the last century, number of cars engaged in vehicular traffic in cities has increased rapidly, causing many difficulties for all citizens: traffic jams, large and unpredictable communication delays, pollution etc. Excessive traffic became a civilization problem that affects everyone who lives in a city of 50,000 or larger, anywhere in the world. Complexity of processes that stand behind traffic flow is so large, that only data mining algorithms &#8211; from the domains of structure mining, graph mining, data streams, large-scale and temporal data mining &#8211; may bring efficient solutions for these problems. With the proposed competition, we want to ask researchers to devise the best possible algorithms that tackle problems of traffic flow prediction, for the purpose of intelligent driver navigation and improved city planning.<br />
There are Tyree independent tasks:</p>
<ul>
<li>Traffic (<a href="http://tunedit.org/challenge/IEEE-ICDM-2010/traffic">link</a>). Traffic congestion prediction, in an elementary setup of time series forecasting: a series of measurements from 10 selected road segments is given and the goal is to make short-term predictions of future values based on historical ones. This task is intended as an introductory one, simpler than the other two.
</li>
<li>Jams (<a href="http://tunedit.org/challenge/IEEE-ICDM-2010/jams">link</a>). Modeling the process of traffic jams formation during morning peak in the presence of roadworks, based on initial information about jams broadcast by radio stations. Input data contain identifiers of road segments closed due to roadworks, accompanied by a sequence of segments where the first jams occurred. The algorithm should predict a sequence of segments where next jams will occur in the nearest future. </li>
<li>GPS (<a href="http://tunedit.org/challenge/IEEE-ICDM-2010/gps">link</a>). Traffic reconstruction and prediction based on real-time information from individual drivers. Input data consist of a stream of notifications from 1% of vehicles about their current GPS locations in the city road network, sent every 10 seconds. The algorithm receives this stream and predicts traffic congestion on selected road segments for the next 30 minutes. Large volumes of data are involved in this task, requiring the use of scalable data mining methods. </li>
</ul>
<p>Everyone is welcome to participate. Competition starts now and will last till September 6th, 2010. More details on:  <a href="http://tunedit.org/challenge/IEEE-ICDM-2010">http://tunedit.org/challenge/IEEE-ICDM-2010</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.r-statistics.com/2010/06/contest-road-traffic-prediction-for-intelligent-gps-navigation/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>A new Q&amp;A website for Data-Analysis (based on StackOverFlow engine) &#8211; is waiting for you</title>
		<link>http://www.r-statistics.com/2010/06/a-new-qa-website-for-data-analysis-based-on-stackoverflow-engine-is-waiting-for-you/</link>
		<comments>http://www.r-statistics.com/2010/06/a-new-qa-website-for-data-analysis-based-on-stackoverflow-engine-is-waiting-for-you/#comments</comments>
		<pubDate>Thu, 17 Jun 2010 13:29:55 +0000</pubDate>
		<dc:creator>Tal Galili</dc:creator>
				<category><![CDATA[R]]></category>
		<category><![CDATA[R community]]></category>
		<category><![CDATA[Q&A]]></category>
		<category><![CDATA[R comunity]]></category>
		<category><![CDATA[R Q&A]]></category>
		<category><![CDATA[stackoverflow]]></category>

		<guid isPermaLink="false">http://www.r-statistics.com/?p=415</guid>
		<description><![CDATA[The bottom line of this post is for you to go to: Stack Exchange Q&#38;A site proposal: Statistical Analysis And commit yourself to using the website for asking and answering questions. 144 peoples already committed to using the website, we need 356 more&#8230; If you are looking for the reasons to do so &#8211; read on&#8230; What is the StackOverFlow Q&#38;A website about? StackOverFlow.com (&#8220;SO&#8221; for short) is a programming Q &#38; A site that&#8217;s free. Free to ask questions, [...]]]></description>
			<content:encoded><![CDATA[<p><strong>The bottom line of this post is for you to go to:<br />
<a href="http://bit.ly/aDuRKV">Stack Exchange Q&amp;A site proposal: Statistical Analysis </a><br />
And commit yourself to using the website for asking and answering questions. </strong>144 peoples already committed to using the website, we need 356 more&#8230; <img src='http://www.r-statistics.com/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /><br />
If you are looking for the reasons to do so &#8211; read on&#8230;</p>
<h3>What is the StackOverFlow Q&amp;A website about?</h3>
<p><a href="http://StackOverFlow.com">StackOverFlow.com</a> (&#8220;SO&#8221; for short) is a programming Q &amp; A site that&#8217;s free. Free to ask questions, free to answer questions, free to read. Free, And fast.</p>
<p>For the R community, SO offers a growing database of <a href="http://stackoverflow.com/questions/tagged/R">R related questions and answer</a> (click the link to check them out).</p>
<p>You might be asking yourself what&#8217;s so special about SO over other available resources such as <a href="http://www.r-project.org/mail.html">R mailing lists</a>, <a href="http://www.r-bloggers.com/">R blogs</a>,<a href="http://rwiki.sciviews.org/doku.php"> R wiki</a> and so on?<br />
That is a great question.<br />
<a href="http://www.r-statistics.com/wp-content/uploads/2010/06/venn-diagram.png"><img class="alignnone size-full wp-image-416" title="venn-diagram" src="http://www.r-statistics.com/wp-content/uploads/2010/06/venn-diagram.png" alt="" width="440" height="431" /></a><br />
The answer is that SO succeeds in doing a great job synthesizing aspects of Wikis, Blogs, Forums, and Digg/Reddit to offer a very powerful Q&amp;A website.</p>
<p>In SO, the new questions are like forum/blog posts (A main text with comments/answers).  After someone answers a question, other users can give a thumb-up or a thumb-down to the answer (like digg/reddit).  And all content can be edited, like a wiki page, by the users (provided the user has enough &#8220;karma points&#8221;).<br />
You also get badges (&#8220;awards&#8221;) for a bunch of actions (like coming to the website every day for a month.  Giving an answer that got X amount of thumb-ups and so on).  The awards allows someone who is asking a question to see how much the person who had answered him has good reputation (in terms of acceptance/appreciation of his answers by other SO members).<br />
It also offers a small (but effective) ego-boost for the person who gives answers.</p>
<h3>So if StackOverFlow is so great &#8211; what is this new website you wrote about in the title?</h3>
<p>Well, StackOverFlow has one limitation.  It deals ONLY with programming questions.  Other questions like:</p>
<ul>
<li>Which of the following three graphics best displays this data set? Why?</li>
<li>Can you give an example of where I might prefer to use a z-test vs a t-test?</li>
<li>What is the relationship between Bayesian and neural networks?</li>
</ul>
<p>Will not be answered, and the threads will get closed as being &#8220;off topic&#8221;.  Why? because such questions are dealing with: statistics, data analysis, data mining, data visualization &#8211; But in no means in programming.</p>
<p>So there is no StackOverFlow-like Q&amp;A website for data analysis&#8230; Until now!</p>
<p>In the past few weeks,<a href="http://area51.stackexchange.com/users/14/rob-hyndman"> Rob Hyndman</a> and other users, have made much effort to push the creation of a new website, based on the StackOverFlow engine, to allow for statistically related Q&amp;A.<br />
His proposal for a new website is almost complete.  All it need is for you (yes you), to go to the following link:<br />
<a href="http://bit.ly/aDuRKV">Stack Exchange Q&amp;A site proposal: Statistical Analysis </a><br />
And commit yourself to the website (that is, click the button called &#8220;commit&#8221; &#8211; so to declare that you will have interest in reading, asking and answering questions on such a website)</p>
<p>Once a <del datetime="2010-06-18T04:54:51+00:00">few more tens</del> 379 more people will commit &#8211; the website will go online!</p>
<p>Hope to see you there.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.r-statistics.com/2010/06/a-new-qa-website-for-data-analysis-based-on-stackoverflow-engine-is-waiting-for-you/feed/</wfw:commentRss>
		<slash:comments>18</slash:comments>
		</item>
		<item>
		<title>Clustergram: visualization and diagnostics for cluster analysis (R code)</title>
		<link>http://www.r-statistics.com/2010/06/clustergram-visualization-and-diagnostics-for-cluster-analysis-r-code/</link>
		<comments>http://www.r-statistics.com/2010/06/clustergram-visualization-and-diagnostics-for-cluster-analysis-r-code/#comments</comments>
		<pubDate>Tue, 15 Jun 2010 08:22:34 +0000</pubDate>
		<dc:creator>Tal Galili</dc:creator>
				<category><![CDATA[R]]></category>
		<category><![CDATA[visualization]]></category>
		<category><![CDATA[base graphics]]></category>
		<category><![CDATA[cluster analysis]]></category>
		<category><![CDATA[clustergram]]></category>
		<category><![CDATA[clustering]]></category>
		<category><![CDATA[Dendrogram]]></category>
		<category><![CDATA[diagnose]]></category>
		<category><![CDATA[diagnosing]]></category>
		<category><![CDATA[diagnostics]]></category>
		<category><![CDATA[functions]]></category>
		<category><![CDATA[ggplot]]></category>
		<category><![CDATA[ggplot2]]></category>
		<category><![CDATA[hierarchical clustering]]></category>
		<category><![CDATA[iris]]></category>
		<category><![CDATA[iris data set]]></category>
		<category><![CDATA[large data]]></category>
		<category><![CDATA[matlines]]></category>
		<category><![CDATA[non-hierarchical]]></category>
		<category><![CDATA[parallel coordinates]]></category>
		<category><![CDATA[R code]]></category>
		<category><![CDATA[R functions]]></category>
		<category><![CDATA[tree]]></category>

		<guid isPermaLink="false">http://www.r-statistics.com/?p=391</guid>
		<description><![CDATA[About Clustergrams In 2002, Matthias Schonlau published in &#8220;The Stata Journal&#8221; an article named &#8220;The Clustergram: A graph for visualizing hierarchical and . As explained in the abstract: In hierarchical cluster analysis dendrogram graphs are used to visualize how clusters are formed. I propose an alternative graph named “clustergram” to examine how cluster members are assigned to clusters as the number of clusters increases. This graph is useful in exploratory analysis for non-hierarchical clustering algorithms like k-means and for hierarchical [...]]]></description>
			<content:encoded><![CDATA[<h3>About Clustergrams</h3>
<p>In 2002, <a href="http://www.schonlau.net/clustergram.html">Matthias Schonlau </a>published in &#8220;The Stata Journal&#8221; an article named &#8220;<a href="https://docs.google.com/viewer?url=http://www.schonlau.net/publication/02stata_clustergram.pdf">The Clustergram: A graph for visualizing hierarchical and </a>.  As explained in the abstract:</p>
<blockquote><p>In hierarchical cluster analysis dendrogram graphs are used to visualize how clusters are formed. I propose an alternative graph named “clustergram” to examine how cluster members are assigned to clusters as the number of clusters increases.<br />
This graph is useful in exploratory analysis for non-hierarchical clustering algorithms like k-means and for hierarchical cluster algorithms when the number of observations is large enough to make dendrograms impractical.</p></blockquote>
<p>A <a href="https://docs.google.com/viewer?url=http://www.schonlau.net/publication/04compstat_clustergram.pdf">similar article</a> was later written and was (maybe) published in &#8220;computational statistics&#8221;.</p>
<p>Both articles gives some nice background to known methods like k-means and methods for hierarchical clustering, and then goes on to present examples of using these methods (with the Clustergarm) to analyse some datasets.</p>
<p>Personally, I understand the clustergram to be a type of parallel coordinates plot where each observation is given a vector.  The vector contains the observation&#8217;s location according to how many clusters the dataset was split into.  The scale of the vector is the scale of the first principal component of the data. </p>
<h3>Clustergram in R (a basic function)</h3>
<p>After finding out about this method of visualization, I was hunted by the curiosity to play with it a bit.  Therefore, and since I didn&#8217;t find any implementation of the graph in R, I went about writing the code to implement it.</p>
<p>The code only works for kmeans, but it shows how such a plot can be produced, and could be later modified so to offer methods that will connect with different clustering algorithms.</p>
<p>The function I present here gets a data.frame/matrix with a row for each observation, and the variable dimensions present in the columns.<br />
The function assumes the data is scaled.<br />
The function then goes about calculating the cluster centers for our data, for varying number of clusters.<br />
For each cluster iteration, the cluster centers are multiplied by the first loading of the principal components of the original data.  Thus offering a weighted mean of the each cluster center dimensions that might give a decent representation of that cluster (this method has the known limitations of using the first component of a PCA for dimensionality reduction, but I won&#8217;t go into that in this post).<br />
Finally all of our data points are ordered according to their respective cluster first component, and plotted against the number of clusters (thus creating the clustergram).</p>
<p>My thank goes to <a href="http://had.co.nz/">Hadley Wickham</a> for offering some good tips on how to prepare the graph.</p>
<p>Here is the code (example follows)</p>

<div class="wp_syntax"><div class="code"><pre class="rsplus" style="font-family:monospace;">&nbsp;
&nbsp;
clustergram.<span style="">kmeans</span> <span style="color: #080;">&lt;-</span> <span style="color: #0000FF; font-weight: bold;">function</span><span style="color: #080;">&#40;</span>Data, k, ...<span style="color: #080;">&#41;</span>
<span style="color: #080;">&#123;</span>
	<span style="color: #228B22;"># this is the type of function that the clustergram</span>
	<span style="color: #228B22;"># 	function takes for the clustering.</span>
	<span style="color: #228B22;"># 	using similar structure will allow implementation of different clustering algorithms</span>
&nbsp;
	<span style="color: #228B22;">#	It returns a list with two elements:</span>
	<span style="color: #228B22;">#	cluster = a vector of length of n (the number of subjects/items)</span>
	<span style="color: #228B22;">#				indicating to which cluster each item belongs.</span>
	<span style="color: #228B22;">#	centers = a k dimensional vector.  Each element is 1 number that represent that cluster</span>
	<span style="color: #228B22;">#				In our case, we are using the weighted mean of the cluster dimensions by </span>
	<span style="color: #228B22;">#				Using the first component (loading) of the PCA of the Data.</span>
&nbsp;
	cl <span style="color: #080;">&lt;-</span> <span style="color: #0000FF; font-weight: bold;">kmeans</span><span style="color: #080;">&#40;</span>Data, k,...<span style="color: #080;">&#41;</span>
&nbsp;
	cluster <span style="color: #080;">&lt;-</span> cl$cluster
	centers <span style="color: #080;">&lt;-</span> cl$centers <span style="color: #080;">%*%</span> <span style="color: #0000FF; font-weight: bold;">princomp</span><span style="color: #080;">&#40;</span>Data<span style="color: #080;">&#41;</span>$loadings<span style="color: #080;">&#91;</span>,<span style="color: #ff0000;">1</span><span style="color: #080;">&#93;</span>	<span style="color: #228B22;"># 1 number per center</span>
												<span style="color: #228B22;"># here we are using the weighted mean for each</span>
&nbsp;
	<span style="color: #0000FF; font-weight: bold;">return</span><span style="color: #080;">&#40;</span><span style="color: #0000FF; font-weight: bold;">list</span><span style="color: #080;">&#40;</span>
				cluster <span style="color: #080;">=</span> cluster,
				centers <span style="color: #080;">=</span> centers
			<span style="color: #080;">&#41;</span><span style="color: #080;">&#41;</span>
<span style="color: #080;">&#125;</span>		
&nbsp;
clustergram.<span style="">plot</span>.<span style="">matlines</span> <span style="color: #080;">&lt;-</span> <span style="color: #0000FF; font-weight: bold;">function</span><span style="color: #080;">&#40;</span>X,Y, k.<span style="">range</span>, 
											x.<span style="">range</span>, y.<span style="">range</span> , COL, 
											add.<span style="">center</span>.<span style="">points</span> , centers.<span style="">points</span><span style="color: #080;">&#41;</span>
	<span style="color: #080;">&#123;</span>
		<span style="color: #0000FF; font-weight: bold;">plot</span><span style="color: #080;">&#40;</span><span style="color: #ff0000;">0</span>,<span style="color: #ff0000;">0</span>, <span style="color: #0000FF; font-weight: bold;">col</span> <span style="color: #080;">=</span> <span style="color: #ff0000;">&quot;white&quot;</span>, xlim <span style="color: #080;">=</span> x.<span style="">range</span>, ylim <span style="color: #080;">=</span> y.<span style="">range</span>,
			axes <span style="color: #080;">=</span> <span style="color: #0000FF; font-weight: bold;">F</span>,
			xlab <span style="color: #080;">=</span> <span style="color: #ff0000;">&quot;Number of clusters (k)&quot;</span>, ylab <span style="color: #080;">=</span> <span style="color: #ff0000;">&quot;PCA weighted Mean of the clusters&quot;</span>, main <span style="color: #080;">=</span> <span style="color: #ff0000;">&quot;Clustergram of the PCA-weighted Mean of the clusters k-mean clusters vs number of clusters (k)&quot;</span><span style="color: #080;">&#41;</span>
		<span style="color: #0000FF; font-weight: bold;">axis</span><span style="color: #080;">&#40;</span>side <span style="color: #080;">=</span><span style="color: #ff0000;">1</span>, at <span style="color: #080;">=</span> k.<span style="">range</span><span style="color: #080;">&#41;</span>
		<span style="color: #0000FF; font-weight: bold;">axis</span><span style="color: #080;">&#40;</span>side <span style="color: #080;">=</span><span style="color: #ff0000;">2</span><span style="color: #080;">&#41;</span>
		<span style="color: #0000FF; font-weight: bold;">abline</span><span style="color: #080;">&#40;</span>v <span style="color: #080;">=</span> k.<span style="">range</span>, <span style="color: #0000FF; font-weight: bold;">col</span> <span style="color: #080;">=</span> <span style="color: #ff0000;">&quot;grey&quot;</span><span style="color: #080;">&#41;</span>
&nbsp;
		<span style="color: #0000FF; font-weight: bold;">matlines</span><span style="color: #080;">&#40;</span><span style="color: #0000FF; font-weight: bold;">t</span><span style="color: #080;">&#40;</span>X<span style="color: #080;">&#41;</span>, <span style="color: #0000FF; font-weight: bold;">t</span><span style="color: #080;">&#40;</span>Y<span style="color: #080;">&#41;</span>, pch <span style="color: #080;">=</span> <span style="color: #ff0000;">19</span>, <span style="color: #0000FF; font-weight: bold;">col</span> <span style="color: #080;">=</span> COL, lty <span style="color: #080;">=</span> <span style="color: #ff0000;">1</span>, lwd <span style="color: #080;">=</span> <span style="color: #ff0000;">1.5</span><span style="color: #080;">&#41;</span>
&nbsp;
		<span style="color: #0000FF; font-weight: bold;">if</span><span style="color: #080;">&#40;</span>add.<span style="">center</span>.<span style="">points</span><span style="color: #080;">&#41;</span>
		<span style="color: #080;">&#123;</span>
			<span style="color: #0000FF; font-weight: bold;">require</span><span style="color: #080;">&#40;</span>plyr<span style="color: #080;">&#41;</span>
&nbsp;
			xx <span style="color: #080;">&lt;-</span> ldply<span style="color: #080;">&#40;</span>centers.<span style="">points</span>, <span style="color: #0000FF; font-weight: bold;">rbind</span><span style="color: #080;">&#41;</span>
			<span style="color: #0000FF; font-weight: bold;">points</span><span style="color: #080;">&#40;</span>xx$y~xx$x, pch <span style="color: #080;">=</span> <span style="color: #ff0000;">19</span>, <span style="color: #0000FF; font-weight: bold;">col</span> <span style="color: #080;">=</span> <span style="color: #ff0000;">&quot;red&quot;</span>, cex <span style="color: #080;">=</span> <span style="color: #ff0000;">1.3</span><span style="color: #080;">&#41;</span>
&nbsp;
			<span style="color: #228B22;"># add points	</span>
			<span style="color: #228B22;"># temp &lt;- l_ply(centers.points, function(xx) {</span>
									<span style="color: #228B22;"># with(xx,points(y~x, pch = 19, col = &quot;red&quot;, cex = 1.3))</span>
									<span style="color: #228B22;"># points(xx$y~xx$x, pch = 19, col = &quot;red&quot;, cex = 1.3)</span>
									<span style="color: #228B22;"># return(1)</span>
									<span style="color: #228B22;"># })</span>
						<span style="color: #228B22;"># We assign the lapply to a variable (temp) only to suppress the lapply &quot;NULL&quot; output</span>
		<span style="color: #080;">&#125;</span>	
	<span style="color: #080;">&#125;</span>
&nbsp;
&nbsp;
&nbsp;
clustergram <span style="color: #080;">&lt;-</span> <span style="color: #0000FF; font-weight: bold;">function</span><span style="color: #080;">&#40;</span>Data, k.<span style="">range</span> <span style="color: #080;">=</span> <span style="color: #ff0000;">2</span><span style="color: #080;">:</span><span style="color: #ff0000;">10</span> , 
							clustering.<span style="">function</span> <span style="color: #080;">=</span> clustergram.<span style="">kmeans</span>,
							clustergram.<span style="">plot</span> <span style="color: #080;">=</span> clustergram.<span style="">plot</span>.<span style="">matlines</span>, 
							line.<span style="">width</span> <span style="color: #080;">=</span> .004, add.<span style="">center</span>.<span style="">points</span> <span style="color: #080;">=</span> <span style="color: #0000FF; font-weight: bold;">T</span><span style="color: #080;">&#41;</span>
<span style="color: #080;">&#123;</span>
	<span style="color: #228B22;"># Data - should be a scales matrix.  Where each column belongs to a different dimension of the observations</span>
	<span style="color: #228B22;"># k.range - is a vector with the number of clusters to plot the clustergram for</span>
	<span style="color: #228B22;"># clustering.function - this is not really used, but offers a bases to later extend the function to other algorithms </span>
	<span style="color: #228B22;">#			Although that would  more work on the code</span>
	<span style="color: #228B22;"># line.width - is the amount to lift each line in the plot so they won't superimpose eachother</span>
	<span style="color: #228B22;"># add.center.points - just assures that we want to plot points of the cluster means</span>
&nbsp;
	n <span style="color: #080;">&lt;-</span> <span style="color: #0000FF; font-weight: bold;">dim</span><span style="color: #080;">&#40;</span>Data<span style="color: #080;">&#41;</span><span style="color: #080;">&#91;</span><span style="color: #ff0000;">1</span><span style="color: #080;">&#93;</span>
&nbsp;
	PCA.1 <span style="color: #080;">&lt;-</span> Data <span style="color: #080;">%*%</span> <span style="color: #0000FF; font-weight: bold;">princomp</span><span style="color: #080;">&#40;</span>Data<span style="color: #080;">&#41;</span>$loadings<span style="color: #080;">&#91;</span>,<span style="color: #ff0000;">1</span><span style="color: #080;">&#93;</span>	<span style="color: #228B22;"># first principal component of our data</span>
&nbsp;
	<span style="color: #0000FF; font-weight: bold;">if</span><span style="color: #080;">&#40;</span><span style="color: #0000FF; font-weight: bold;">require</span><span style="color: #080;">&#40;</span>colorspace<span style="color: #080;">&#41;</span><span style="color: #080;">&#41;</span> <span style="color: #080;">&#123;</span>
			COL <span style="color: #080;">&lt;-</span> heat_hcl<span style="color: #080;">&#40;</span>n<span style="color: #080;">&#41;</span><span style="color: #080;">&#91;</span><span style="color: #0000FF; font-weight: bold;">order</span><span style="color: #080;">&#40;</span>PCA.1<span style="color: #080;">&#41;</span><span style="color: #080;">&#93;</span>	<span style="color: #228B22;"># line colors</span>
		<span style="color: #080;">&#125;</span> <span style="color: #0000FF; font-weight: bold;">else</span> <span style="color: #080;">&#123;</span>
			COL <span style="color: #080;">&lt;-</span> <span style="color: #0000FF; font-weight: bold;">rainbow</span><span style="color: #080;">&#40;</span>n<span style="color: #080;">&#41;</span><span style="color: #080;">&#91;</span><span style="color: #0000FF; font-weight: bold;">order</span><span style="color: #080;">&#40;</span>PCA.1<span style="color: #080;">&#41;</span><span style="color: #080;">&#93;</span>	<span style="color: #228B22;"># line colors</span>
			<span style="color: #0000FF; font-weight: bold;">warning</span><span style="color: #080;">&#40;</span><span style="color: #ff0000;">'Please consider installing the package &quot;colorspace&quot; for prittier colors'</span><span style="color: #080;">&#41;</span>
		<span style="color: #080;">&#125;</span>
&nbsp;
	line.<span style="">width</span> <span style="color: #080;">&lt;-</span> <span style="color: #0000FF; font-weight: bold;">rep</span><span style="color: #080;">&#40;</span>line.<span style="">width</span>, n<span style="color: #080;">&#41;</span>
&nbsp;
	Y <span style="color: #080;">&lt;-</span> NULL	<span style="color: #228B22;"># Y matrix</span>
	X <span style="color: #080;">&lt;-</span> NULL	<span style="color: #228B22;"># X matrix</span>
&nbsp;
	centers.<span style="">points</span> <span style="color: #080;">&lt;-</span> <span style="color: #0000FF; font-weight: bold;">list</span><span style="color: #080;">&#40;</span><span style="color: #080;">&#41;</span>
&nbsp;
	<span style="color: #0000FF; font-weight: bold;">for</span><span style="color: #080;">&#40;</span>k <span style="color: #0000FF; font-weight: bold;">in</span> k.<span style="">range</span><span style="color: #080;">&#41;</span>
	<span style="color: #080;">&#123;</span>
		k.<span style="">clusters</span> <span style="color: #080;">&lt;-</span> clustering.<span style="">function</span><span style="color: #080;">&#40;</span>Data, k<span style="color: #080;">&#41;</span>
&nbsp;
		clusters.<span style="">vec</span> <span style="color: #080;">&lt;-</span> k.<span style="">clusters</span>$cluster
			<span style="color: #228B22;"># the.centers &lt;- apply(cl$centers,1, mean)</span>
		the.<span style="">centers</span> <span style="color: #080;">&lt;-</span> k.<span style="">clusters</span>$centers 
&nbsp;
		noise <span style="color: #080;">&lt;-</span> <span style="color: #0000FF; font-weight: bold;">unlist</span><span style="color: #080;">&#40;</span><span style="color: #0000FF; font-weight: bold;">tapply</span><span style="color: #080;">&#40;</span>line.<span style="">width</span>, clusters.<span style="">vec</span>, <span style="color: #0000FF; font-weight: bold;">cumsum</span><span style="color: #080;">&#41;</span><span style="color: #080;">&#41;</span><span style="color: #080;">&#91;</span><span style="color: #0000FF; font-weight: bold;">order</span><span style="color: #080;">&#40;</span><span style="color: #0000FF; font-weight: bold;">seq_along</span><span style="color: #080;">&#40;</span>clusters.<span style="">vec</span><span style="color: #080;">&#41;</span><span style="color: #080;">&#91;</span><span style="color: #0000FF; font-weight: bold;">order</span><span style="color: #080;">&#40;</span>clusters.<span style="">vec</span><span style="color: #080;">&#41;</span><span style="color: #080;">&#93;</span><span style="color: #080;">&#41;</span><span style="color: #080;">&#93;</span>	
		<span style="color: #228B22;"># noise &lt;- noise - mean(range(noise))</span>
		y <span style="color: #080;">&lt;-</span> the.<span style="">centers</span><span style="color: #080;">&#91;</span>clusters.<span style="">vec</span><span style="color: #080;">&#93;</span> <span style="color: #080;">+</span> noise
		Y <span style="color: #080;">&lt;-</span> <span style="color: #0000FF; font-weight: bold;">cbind</span><span style="color: #080;">&#40;</span>Y, y<span style="color: #080;">&#41;</span>
		x <span style="color: #080;">&lt;-</span> <span style="color: #0000FF; font-weight: bold;">rep</span><span style="color: #080;">&#40;</span>k, <span style="color: #0000FF; font-weight: bold;">length</span><span style="color: #080;">&#40;</span>y<span style="color: #080;">&#41;</span><span style="color: #080;">&#41;</span>
		X <span style="color: #080;">&lt;-</span> <span style="color: #0000FF; font-weight: bold;">cbind</span><span style="color: #080;">&#40;</span>X, x<span style="color: #080;">&#41;</span>
&nbsp;
		centers.<span style="">points</span><span style="color: #080;">&#91;</span><span style="color: #080;">&#91;</span>k<span style="color: #080;">&#93;</span><span style="color: #080;">&#93;</span> <span style="color: #080;">&lt;-</span> <span style="color: #0000FF; font-weight: bold;">data.<span style="">frame</span></span><span style="color: #080;">&#40;</span>y <span style="color: #080;">=</span> the.<span style="">centers</span> , x <span style="color: #080;">=</span> <span style="color: #0000FF; font-weight: bold;">rep</span><span style="color: #080;">&#40;</span>k , k<span style="color: #080;">&#41;</span><span style="color: #080;">&#41;</span>	
	<span style="color: #228B22;">#	points(the.centers ~ rep(k , k), pch = 19, col = &quot;red&quot;, cex = 1.5)</span>
	<span style="color: #080;">&#125;</span>
&nbsp;
&nbsp;
	x.<span style="">range</span> <span style="color: #080;">&lt;-</span> <span style="color: #0000FF; font-weight: bold;">range</span><span style="color: #080;">&#40;</span>k.<span style="">range</span><span style="color: #080;">&#41;</span>
	y.<span style="">range</span> <span style="color: #080;">&lt;-</span> <span style="color: #0000FF; font-weight: bold;">range</span><span style="color: #080;">&#40;</span>PCA.1<span style="color: #080;">&#41;</span>
&nbsp;
	clustergram.<span style="">plot</span><span style="color: #080;">&#40;</span>X,Y, k.<span style="">range</span>, 
											x.<span style="">range</span>, y.<span style="">range</span> , COL, 
											add.<span style="">center</span>.<span style="">points</span> , centers.<span style="">points</span><span style="color: #080;">&#41;</span>
&nbsp;
&nbsp;
<span style="color: #080;">&#125;</span></pre></div></div>

<h3>Example on the iris dataset</h3>
<p>The<a href="http://en.wikipedia.org/wiki/Iris_flower_data_set"> iris data set</a> is a favorite example of many <a href="http://www.r-bloggers.com/?s=iris">R bloggers </a> when writing about <a href="http://opendatagroup.com/2009/10/21/r-accessors-explained/">R accessors </a>, <a href="http://learnr.wordpress.com/2009/10/06/export-data-frames-to-multi-worksheet-excel-file/">Data Exporting</a>, <a href="http://yihui.name/en/2009/09/how-to-import-ms-excel-data-into-r/">Data importing</a>, and for <a href="http://weitaiyun.blogspot.com/2009/03/unison-graph-and-parallel-coordinate.html">different </a><a href="http://weitaiyun.blogspot.com/2009/03/scatterplots.html">visualization </a>techniques.<br />
So it seemed only natural to experiment on it here.</p>

<div class="wp_syntax"><div class="code"><pre class="rsplus" style="font-family:monospace;"><span style="color: #0000FF; font-weight: bold;">data</span><span style="color: #080;">&#40;</span><span style="color: #CC9900; font-weight: bold;">iris</span><span style="color: #080;">&#41;</span>
<span style="color: #0000FF; font-weight: bold;">set.<span style="">seed</span></span><span style="color: #080;">&#40;</span><span style="color: #ff0000;">250</span><span style="color: #080;">&#41;</span>
<span style="color: #0000FF; font-weight: bold;">par</span><span style="color: #080;">&#40;</span>cex.<span style="">lab</span> <span style="color: #080;">=</span> <span style="color: #ff0000;">1.5</span>, cex.<span style="">main</span> <span style="color: #080;">=</span> <span style="color: #ff0000;">1.2</span><span style="color: #080;">&#41;</span>
Data <span style="color: #080;">&lt;-</span> <span style="color: #0000FF; font-weight: bold;">scale</span><span style="color: #080;">&#40;</span><span style="color: #CC9900; font-weight: bold;">iris</span><span style="color: #080;">&#91;</span>,<span style="color: #080;">-</span><span style="color: #ff0000;">5</span><span style="color: #080;">&#93;</span><span style="color: #080;">&#41;</span> <span style="color: #228B22;"># notice I am scaling the vectors)</span>
clustergram<span style="color: #080;">&#40;</span>Data, k.<span style="">range</span> <span style="color: #080;">=</span> <span style="color: #ff0000;">2</span><span style="color: #080;">:</span><span style="color: #ff0000;">8</span>, line.<span style="">width</span> <span style="color: #080;">=</span> <span style="color: #ff0000;">0.004</span><span style="color: #080;">&#41;</span> <span style="color: #228B22;"># notice how I am using line.width.  Play with it on your problem, according to the scale of Y.</span></pre></div></div>

<p>Here is the output:<br />
<a href="http://www.r-statistics.com/wp-content/uploads/2010/06/clustergram-1.png"><img src="http://www.r-statistics.com/wp-content/uploads/2010/06/clustergram-1.png" alt="" title="clustergram 1" width="500"></a></p>
<p>Looking at the image we can notice a few interesting things.  We notice that one of the clusters formed (the lower one) stays as is no matter how many clusters we are allowing (except for one observation that goes way and then beck).<br />
We can also see that the second split is a solid one (in the sense that it splits the first cluster into two clusters which are not &#8220;close&#8221; to each other, and that about half the observations goes to each of the new clusters).<br />
And then notice how moving to 5 clusters makes almost no difference.<br />
Lastly, notice how when going for 8 clusters, we are practically left with 4 clusters (remember &#8211; this is according the mean of cluster centers by the loading of the first component of the PCA on the data)</p>
<p>If I where to take something from this graph, I would say I have a strong tendency to use 3-4 clusters on this data.</p>
<p>But wait, did our clustering algorithm do a stable job?<br />
Let&#8217;s try running the algorithm 6 more times (each run will have a different starting point for the clusters)</p>

<div class="wp_syntax"><div class="code"><pre class="rsplus" style="font-family:monospace;"><span style="color: #0000FF; font-weight: bold;">set.<span style="">seed</span></span><span style="color: #080;">&#40;</span><span style="color: #ff0000;">500</span><span style="color: #080;">&#41;</span>
Data <span style="color: #080;">&lt;-</span> <span style="color: #0000FF; font-weight: bold;">scale</span><span style="color: #080;">&#40;</span><span style="color: #CC9900; font-weight: bold;">iris</span><span style="color: #080;">&#91;</span>,<span style="color: #080;">-</span><span style="color: #ff0000;">5</span><span style="color: #080;">&#93;</span><span style="color: #080;">&#41;</span> <span style="color: #228B22;"># notice I am scaling the vectors)</span>
<span style="color: #0000FF; font-weight: bold;">par</span><span style="color: #080;">&#40;</span>cex.<span style="">lab</span> <span style="color: #080;">=</span> <span style="color: #ff0000;">1.2</span>, cex.<span style="">main</span> <span style="color: #080;">=</span> .7<span style="color: #080;">&#41;</span>
<span style="color: #0000FF; font-weight: bold;">par</span><span style="color: #080;">&#40;</span>mfrow <span style="color: #080;">=</span> <span style="color: #0000FF; font-weight: bold;">c</span><span style="color: #080;">&#40;</span><span style="color: #ff0000;">3</span>,<span style="color: #ff0000;">2</span><span style="color: #080;">&#41;</span><span style="color: #080;">&#41;</span>
<span style="color: #0000FF; font-weight: bold;">for</span><span style="color: #080;">&#40;</span>i <span style="color: #0000FF; font-weight: bold;">in</span> <span style="color: #ff0000;">1</span><span style="color: #080;">:</span><span style="color: #ff0000;">6</span><span style="color: #080;">&#41;</span> clustergram<span style="color: #080;">&#40;</span>Data, k.<span style="">range</span> <span style="color: #080;">=</span> <span style="color: #ff0000;">2</span><span style="color: #080;">:</span><span style="color: #ff0000;">8</span> , line.<span style="">width</span> <span style="color: #080;">=</span> .004, add.<span style="">center</span>.<span style="">points</span> <span style="color: #080;">=</span> <span style="color: #0000FF; font-weight: bold;">T</span><span style="color: #080;">&#41;</span></pre></div></div>

<p>Resulting with:  (press the image to enlarge it)<br />
<a href="http://www.r-statistics.com/wp-content/uploads/2010/06/clustergram-6.png"><img src="http://www.r-statistics.com/wp-content/uploads/2010/06/clustergram-6.png" alt="" title="clustergram 6" width="500"></a><br />
Repeating the analysis offers even more insights.<br />
First, it would appear that until 3 clusters, the algorithm gives rather stable results.<br />
From 4 onwards we get various outcomes at each iteration.<br />
At some of the cases, we got 3 clusters when we asked for 4 or even 5 clusters.</p>
<p>Reviewing the new plots, I would prefer to go with the 3 clusters option.  Noting how the two &#8220;upper&#8221; clusters might have similar properties while the lower cluster is quite distinct from the other two.</p>
<p>By the way, the Iris data set is composed of three types of flowers.  I imagine the kmeans  had done a decent job in distinguishing the three.</p>
<h3>Limitation of the method (and a possible way to overcome it?!)</h3>
<p>It is worth noting that the current way the algorithm is built has a fundamental limitation:  The plot is good for detecting a situation where there are several clusters but each of them is clearly &#8220;bigger&#8221; then the one before it (on the first principal component of the data).</p>
<p>For example, let&#8217;s create a dataset with 3 clusters, each one is taken from a normal distribution with a higher mean:</p>

<div class="wp_syntax"><div class="code"><pre class="rsplus" style="font-family:monospace;"><span style="color: #0000FF; font-weight: bold;">set.<span style="">seed</span></span><span style="color: #080;">&#40;</span><span style="color: #ff0000;">250</span><span style="color: #080;">&#41;</span>
Data <span style="color: #080;">&lt;-</span> <span style="color: #0000FF; font-weight: bold;">rbind</span><span style="color: #080;">&#40;</span>
				<span style="color: #0000FF; font-weight: bold;">cbind</span><span style="color: #080;">&#40;</span><span style="color: #0000FF; font-weight: bold;">rnorm</span><span style="color: #080;">&#40;</span><span style="color: #ff0000;">100</span>,<span style="color: #ff0000;">0</span>, <span style="color: #0000FF; font-weight: bold;">sd</span> <span style="color: #080;">=</span> <span style="color: #ff0000;">0.3</span><span style="color: #080;">&#41;</span>,<span style="color: #0000FF; font-weight: bold;">rnorm</span><span style="color: #080;">&#40;</span><span style="color: #ff0000;">100</span>,<span style="color: #ff0000;">0</span>, <span style="color: #0000FF; font-weight: bold;">sd</span> <span style="color: #080;">=</span> <span style="color: #ff0000;">0.3</span><span style="color: #080;">&#41;</span>,<span style="color: #0000FF; font-weight: bold;">rnorm</span><span style="color: #080;">&#40;</span><span style="color: #ff0000;">100</span>,<span style="color: #ff0000;">0</span>, <span style="color: #0000FF; font-weight: bold;">sd</span> <span style="color: #080;">=</span> <span style="color: #ff0000;">0.3</span><span style="color: #080;">&#41;</span><span style="color: #080;">&#41;</span>,
				<span style="color: #0000FF; font-weight: bold;">cbind</span><span style="color: #080;">&#40;</span><span style="color: #0000FF; font-weight: bold;">rnorm</span><span style="color: #080;">&#40;</span><span style="color: #ff0000;">100</span>,<span style="color: #ff0000;">1</span>, <span style="color: #0000FF; font-weight: bold;">sd</span> <span style="color: #080;">=</span> <span style="color: #ff0000;">0.3</span><span style="color: #080;">&#41;</span>,<span style="color: #0000FF; font-weight: bold;">rnorm</span><span style="color: #080;">&#40;</span><span style="color: #ff0000;">100</span>,<span style="color: #ff0000;">1</span>, <span style="color: #0000FF; font-weight: bold;">sd</span> <span style="color: #080;">=</span> <span style="color: #ff0000;">0.3</span><span style="color: #080;">&#41;</span>,<span style="color: #0000FF; font-weight: bold;">rnorm</span><span style="color: #080;">&#40;</span><span style="color: #ff0000;">100</span>,<span style="color: #ff0000;">1</span>, <span style="color: #0000FF; font-weight: bold;">sd</span> <span style="color: #080;">=</span> <span style="color: #ff0000;">0.3</span><span style="color: #080;">&#41;</span><span style="color: #080;">&#41;</span>,
				<span style="color: #0000FF; font-weight: bold;">cbind</span><span style="color: #080;">&#40;</span><span style="color: #0000FF; font-weight: bold;">rnorm</span><span style="color: #080;">&#40;</span><span style="color: #ff0000;">100</span>,<span style="color: #ff0000;">2</span>, <span style="color: #0000FF; font-weight: bold;">sd</span> <span style="color: #080;">=</span> <span style="color: #ff0000;">0.3</span><span style="color: #080;">&#41;</span>,<span style="color: #0000FF; font-weight: bold;">rnorm</span><span style="color: #080;">&#40;</span><span style="color: #ff0000;">100</span>,<span style="color: #ff0000;">2</span>, <span style="color: #0000FF; font-weight: bold;">sd</span> <span style="color: #080;">=</span> <span style="color: #ff0000;">0.3</span><span style="color: #080;">&#41;</span>,<span style="color: #0000FF; font-weight: bold;">rnorm</span><span style="color: #080;">&#40;</span><span style="color: #ff0000;">100</span>,<span style="color: #ff0000;">2</span>, <span style="color: #0000FF; font-weight: bold;">sd</span> <span style="color: #080;">=</span> <span style="color: #ff0000;">0.3</span><span style="color: #080;">&#41;</span><span style="color: #080;">&#41;</span>
				<span style="color: #080;">&#41;</span>				
clustergram<span style="color: #080;">&#40;</span>Data, k.<span style="">range</span> <span style="color: #080;">=</span> <span style="color: #ff0000;">2</span><span style="color: #080;">:</span><span style="color: #ff0000;">5</span> , line.<span style="">width</span> <span style="color: #080;">=</span> .004, add.<span style="">center</span>.<span style="">points</span> <span style="color: #080;">=</span> <span style="color: #0000FF; font-weight: bold;">T</span><span style="color: #080;">&#41;</span></pre></div></div>

<p>The resulting plot for this is the following:<br />
<a href="http://www.r-statistics.com/wp-content/uploads/2010/06/Clustergram-3-ordered-clusters.png"><img src="http://www.r-statistics.com/wp-content/uploads/2010/06/Clustergram-3-ordered-clusters.png" alt="" title="Clustergram-3-ordered-clusters" width="500" class="alignnone size-full wp-image-402" /></a><br />
The image shows a clear distinction between three ranks of clusters.  There is no doubt (for me) from looking at this image, that three clusters would be the correct number of clusters.</p>
<p>But what if the clusters where different but didn&#8217;t have an ordering to them?<br />
For example, look at the following 4 dimensional data:</p>

<div class="wp_syntax"><div class="code"><pre class="rsplus" style="font-family:monospace;"><span style="color: #0000FF; font-weight: bold;">set.<span style="">seed</span></span><span style="color: #080;">&#40;</span><span style="color: #ff0000;">250</span><span style="color: #080;">&#41;</span>
Data <span style="color: #080;">&lt;-</span> <span style="color: #0000FF; font-weight: bold;">rbind</span><span style="color: #080;">&#40;</span>
				<span style="color: #0000FF; font-weight: bold;">cbind</span><span style="color: #080;">&#40;</span><span style="color: #0000FF; font-weight: bold;">rnorm</span><span style="color: #080;">&#40;</span><span style="color: #ff0000;">100</span>,<span style="color: #ff0000;">1</span>, <span style="color: #0000FF; font-weight: bold;">sd</span> <span style="color: #080;">=</span> <span style="color: #ff0000;">0.3</span><span style="color: #080;">&#41;</span>,<span style="color: #0000FF; font-weight: bold;">rnorm</span><span style="color: #080;">&#40;</span><span style="color: #ff0000;">100</span>,<span style="color: #ff0000;">0</span>, <span style="color: #0000FF; font-weight: bold;">sd</span> <span style="color: #080;">=</span> <span style="color: #ff0000;">0.3</span><span style="color: #080;">&#41;</span>,<span style="color: #0000FF; font-weight: bold;">rnorm</span><span style="color: #080;">&#40;</span><span style="color: #ff0000;">100</span>,<span style="color: #ff0000;">0</span>, <span style="color: #0000FF; font-weight: bold;">sd</span> <span style="color: #080;">=</span> <span style="color: #ff0000;">0.3</span><span style="color: #080;">&#41;</span>,<span style="color: #0000FF; font-weight: bold;">rnorm</span><span style="color: #080;">&#40;</span><span style="color: #ff0000;">100</span>,<span style="color: #ff0000;">0</span>, <span style="color: #0000FF; font-weight: bold;">sd</span> <span style="color: #080;">=</span> <span style="color: #ff0000;">0.3</span><span style="color: #080;">&#41;</span><span style="color: #080;">&#41;</span>,
				<span style="color: #0000FF; font-weight: bold;">cbind</span><span style="color: #080;">&#40;</span><span style="color: #0000FF; font-weight: bold;">rnorm</span><span style="color: #080;">&#40;</span><span style="color: #ff0000;">100</span>,<span style="color: #ff0000;">0</span>, <span style="color: #0000FF; font-weight: bold;">sd</span> <span style="color: #080;">=</span> <span style="color: #ff0000;">0.3</span><span style="color: #080;">&#41;</span>,<span style="color: #0000FF; font-weight: bold;">rnorm</span><span style="color: #080;">&#40;</span><span style="color: #ff0000;">100</span>,<span style="color: #ff0000;">1</span>, <span style="color: #0000FF; font-weight: bold;">sd</span> <span style="color: #080;">=</span> <span style="color: #ff0000;">0.3</span><span style="color: #080;">&#41;</span>,<span style="color: #0000FF; font-weight: bold;">rnorm</span><span style="color: #080;">&#40;</span><span style="color: #ff0000;">100</span>,<span style="color: #ff0000;">0</span>, <span style="color: #0000FF; font-weight: bold;">sd</span> <span style="color: #080;">=</span> <span style="color: #ff0000;">0.3</span><span style="color: #080;">&#41;</span>,<span style="color: #0000FF; font-weight: bold;">rnorm</span><span style="color: #080;">&#40;</span><span style="color: #ff0000;">100</span>,<span style="color: #ff0000;">0</span>, <span style="color: #0000FF; font-weight: bold;">sd</span> <span style="color: #080;">=</span> <span style="color: #ff0000;">0.3</span><span style="color: #080;">&#41;</span><span style="color: #080;">&#41;</span>,
				<span style="color: #0000FF; font-weight: bold;">cbind</span><span style="color: #080;">&#40;</span><span style="color: #0000FF; font-weight: bold;">rnorm</span><span style="color: #080;">&#40;</span><span style="color: #ff0000;">100</span>,<span style="color: #ff0000;">0</span>, <span style="color: #0000FF; font-weight: bold;">sd</span> <span style="color: #080;">=</span> <span style="color: #ff0000;">0.3</span><span style="color: #080;">&#41;</span>,<span style="color: #0000FF; font-weight: bold;">rnorm</span><span style="color: #080;">&#40;</span><span style="color: #ff0000;">100</span>,<span style="color: #ff0000;">1</span>, <span style="color: #0000FF; font-weight: bold;">sd</span> <span style="color: #080;">=</span> <span style="color: #ff0000;">0.3</span><span style="color: #080;">&#41;</span>,<span style="color: #0000FF; font-weight: bold;">rnorm</span><span style="color: #080;">&#40;</span><span style="color: #ff0000;">100</span>,<span style="color: #ff0000;">1</span>, <span style="color: #0000FF; font-weight: bold;">sd</span> <span style="color: #080;">=</span> <span style="color: #ff0000;">0.3</span><span style="color: #080;">&#41;</span>,<span style="color: #0000FF; font-weight: bold;">rnorm</span><span style="color: #080;">&#40;</span><span style="color: #ff0000;">100</span>,<span style="color: #ff0000;">0</span>, <span style="color: #0000FF; font-weight: bold;">sd</span> <span style="color: #080;">=</span> <span style="color: #ff0000;">0.3</span><span style="color: #080;">&#41;</span><span style="color: #080;">&#41;</span>,
				<span style="color: #0000FF; font-weight: bold;">cbind</span><span style="color: #080;">&#40;</span><span style="color: #0000FF; font-weight: bold;">rnorm</span><span style="color: #080;">&#40;</span><span style="color: #ff0000;">100</span>,<span style="color: #ff0000;">0</span>, <span style="color: #0000FF; font-weight: bold;">sd</span> <span style="color: #080;">=</span> <span style="color: #ff0000;">0.3</span><span style="color: #080;">&#41;</span>,<span style="color: #0000FF; font-weight: bold;">rnorm</span><span style="color: #080;">&#40;</span><span style="color: #ff0000;">100</span>,<span style="color: #ff0000;">0</span>, <span style="color: #0000FF; font-weight: bold;">sd</span> <span style="color: #080;">=</span> <span style="color: #ff0000;">0.3</span><span style="color: #080;">&#41;</span>,<span style="color: #0000FF; font-weight: bold;">rnorm</span><span style="color: #080;">&#40;</span><span style="color: #ff0000;">100</span>,<span style="color: #ff0000;">0</span>, <span style="color: #0000FF; font-weight: bold;">sd</span> <span style="color: #080;">=</span> <span style="color: #ff0000;">0.3</span><span style="color: #080;">&#41;</span>,<span style="color: #0000FF; font-weight: bold;">rnorm</span><span style="color: #080;">&#40;</span><span style="color: #ff0000;">100</span>,<span style="color: #ff0000;">1</span>, <span style="color: #0000FF; font-weight: bold;">sd</span> <span style="color: #080;">=</span> <span style="color: #ff0000;">0.3</span><span style="color: #080;">&#41;</span><span style="color: #080;">&#41;</span>
				<span style="color: #080;">&#41;</span>				
clustergram<span style="color: #080;">&#40;</span>Data, k.<span style="">range</span> <span style="color: #080;">=</span> <span style="color: #ff0000;">2</span><span style="color: #080;">:</span><span style="color: #ff0000;">8</span> , line.<span style="">width</span> <span style="color: #080;">=</span> .004, add.<span style="">center</span>.<span style="">points</span> <span style="color: #080;">=</span> <span style="color: #0000FF; font-weight: bold;">T</span><span style="color: #080;">&#41;</span></pre></div></div>

<p><a href="http://www.r-statistics.com/wp-content/uploads/2010/06/Clustergram-4-UNordered-clusters.png"><img src="http://www.r-statistics.com/wp-content/uploads/2010/06/Clustergram-4-UNordered-clusters.png" alt="" title="Clustergram-4-UNordered-clusters" width="500" class="alignnone size-full wp-image-403" /></a></p>
<p>In this situation, it is not clear from the location of the clusters on the Y axis that we are dealing with 4 clusters.<br />
But what is interesting, is that through the growing number of clusters, we can notice that there are 4 &#8220;strands&#8221; of data points moving more or less together (until we reached 4 clusters, at which point the clusters started breaking up).<br />
Another hope for handling this might be using the color of the lines in some way, but I haven&#8217;t yet figured out how.</p>
<h3>Clustergram with ggplot2</h3>
<p><a href="http://had.co.nz/">Hadley Wickham</a> has kindly played with recreating the clustergram using the ggplot2 engine.  You can see the result here:<br />
<a href="http://gist.github.com/439761">http://gist.github.com/439761</a><br />
And this is what he wrote about it in the comments:</p>
<blockquote><p>I’ve broken it down into three components:<br />
* run the clustering algorithm and get predictions (many_kmeans and all_hclust)<br />
* produce the data for the clustergram (clustergram)<br />
* plot it (plot.clustergram)<br />
I don’t think I have the logic behind the y-position adjustment quite right though.</p></blockquote>
<p>Here is an example of how it looks:<br />
<a href="http://www.r-statistics.com/wp-content/uploads/2010/06/clustergram-ggplot2-1.png"><img src="http://www.r-statistics.com/wp-content/uploads/2010/06/clustergram-ggplot2-1.png" alt="" title="clustergram-ggplot2-1" width="500" class="alignnone size-full wp-image-407" /></a></p>
<h3>Conclusions (some rules of thumb and questions for the future)</h3>
<p>In a first look, it would appear that the clustergram can be of use.  I can imagine using this graph to quickly run various clustering algorithms and then compare them to each other and review their stability (In the way I just demonstrated in the example above).</p>
<p>The three rules of thumb I have noticed by now are:</p>
<ol>
<li>Look at the location of the cluster points on the Y axis. See when they remain stable, when they start flying around, and what happens to them in higher number of clusters (do they re-group together)</li>
<li>Observe the strands of the datapoints.  Even if the clusters centers are not ordered, the lines for each item might (needs more research and thinking) tend to move together &#8211; hinting at the real number of clusters</li>
<li>Run the plot multiple times to observe the stability of the cluster formation (and location)</li>
</ol>
<p>Yet there is more work to be done and questions to seek answers to:</p>
<ul>
<li>The code needs to be extended to offer methods to various clustering algorithms.
</li>
<li>How can the colors of the lines be used better?
</li>
<li>How can this be done using other graphical engines (ggplot2/lattice?) &#8211; (<strong>Update</strong>: look at Hadley&#8217;s reply in the comments)
</li>
<li>What to do in case the first principal component doesn&#8217;t capture enough of the data? (maybe plot this graph to all the relevant components. but then &#8211; how do you make conclusions of it?)
</li>
<li>What other uses/conclusions can be made based on this graph?
</li>
</ul>
<p>I am looking forward to reading your input/ideas in the comments (or in reply posts).</p>
]]></content:encoded>
			<wfw:commentRss>http://www.r-statistics.com/2010/06/clustergram-visualization-and-diagnostics-for-cluster-analysis-r-code/feed/</wfw:commentRss>
		<slash:comments>14</slash:comments>
		</item>
		<item>
		<title>June 20, online Registration deadline for useR! 2010</title>
		<link>http://www.r-statistics.com/2010/06/june-20-online-registration-deadline-for-user-2010/</link>
		<comments>http://www.r-statistics.com/2010/06/june-20-online-registration-deadline-for-user-2010/#comments</comments>
		<pubDate>Mon, 14 Jun 2010 05:58:49 +0000</pubDate>
		<dc:creator>Tal Galili</dc:creator>
				<category><![CDATA[R]]></category>
		<category><![CDATA[R community]]></category>
		<category><![CDATA[conference]]></category>
		<category><![CDATA[dates]]></category>
		<category><![CDATA[deadline]]></category>
		<category><![CDATA[useR]]></category>
		<category><![CDATA[useR 2010]]></category>
		<category><![CDATA[useR conference]]></category>
		<category><![CDATA[useR2010]]></category>

		<guid isPermaLink="false">http://www.r-statistics.com/?p=387</guid>
		<description><![CDATA[useR!2010 is coming. I am going to give two talks there (I will write more of that soon), but in the meantime, please note that the online registration deadline is coming to an end. This was published on the R-help mailing list today: &#8212;&#8212;&#8212;&#8212;- The final registration deadline for the R User Conference is June 20, 2010, one week away.  Later registration will not be possible on site! Conference webpage:  http://www.R-project.org/useR-2010 Conference program: http://www.R-project.org/useR-2010/program.html Registration: http://www.R-project.org/useR-2010/registration/registration.html The conference is scheduled for [...]]]></description>
			<content:encoded><![CDATA[<p>useR!2010 is coming. I am going to give two talks there (I will write more of that soon), but in the meantime, please note that the online registration deadline is coming to an end.</p>
<p>This was published on the R-help mailing list today:</p>
<p>&#8212;&#8212;&#8212;&#8212;-</p>
<p>The final registration deadline for the R User Conference is June 20,<br />
2010, one week away.  Later registration will not be possible on site!</p>
<p>Conference webpage:  <a href="http://www.r-project.org/useR-2010" target="_blank">http://www.R-project.org/useR-2010</a><br />
Conference program: <a href="http://www.r-project.org/useR-2010/program.html" target="_blank">http://www.R-project.org/useR-2010/program.html</a></p>
<p>Registration:<br />
<a href="http://www.r-project.org/useR-2010/registration/registration.html" target="_blank">http://www.R-project.org/useR-2010/registration/registration.html</a></p>
<p>The conference is scheduled for July 21-23, 2010, and will take place at<br />
the campus of the National Institute of Standards and Technology (NIST) in<br />
Gaithersburg, Maryland, USA.</p>
<p><span id="more-387"></span></p>
<p>Following the successful useR! 2004, useR! 2006, useR! 2007, useR! 2008,<br />
and useR! 2009, conferences, the conference is focused on:</p>
<p>1. R as the `lingua franca&#8217; of data analysis and statistical computing,<br />
2. providing a platform for R users to discuss and exchange ideas on<br />
how R can be used to do statistical computations, data analysis,<br />
visualization and exciting applications in various fields,<br />
3. giving an overview of the new features of the rapidly evolving R<br />
project.</p>
<p>As for the predecessor conferences, the program will consist of two parts:<br />
invited lectures and user-contributed sessions.  Prior to the conference,<br />
there will be tutorials on R, descriptions of which are available at<br />
<a href="http://www.r-project.org/useR-2010/tutorials" target="_blank">http://www.R-project.org/useR-2010/tutorials</a></p>
<p>INVITED LECTURES</p>
<p>Invited speakers will include</p>
<p>Mark Handcock, Frank Harrell Jr, Friedrich Leisch, Michael Meyer,<br />
Richard Stallman, Luke Tierney, Diethelm Wuertz.</p>
<p>USER-CONTRIBUTED SESSIONS</p>
<p>The sessions will be a platform to bring together R users, contributors,<br />
package maintainers and developers in the S spirit that `users are<br />
developers&#8217;. People from different fields will show us how they solve<br />
problems with R in fascinating applications.  The sessions are organized<br />
by members of the program committee, including</p>
<p>Dirk Eddelbuettel, John Fox, Virgilio Gomez-Rubio,<br />
Richard Heiberger, Torsten Hothorn, Aaron King, Jan de Leeuw,<br />
Nicholas Lewin-Koh, Andy Liaw, Uwe Ligges, Martin Maechler,<br />
Katharine Mullen, Heather Turner, Ravi Varadhan, H. D. Vinod,<br />
John Verzani, Alan Zaslavsky, Achim Zeileis.</p>
<p>The program will cover topics such as</p>
<p>* Applied Statistics &amp; Biostatistics<br />
* Bayesian Statistics<br />
* Bioinformatics<br />
* Chemometrics and Computational Physics<br />
* Data Mining<br />
* Econometrics &amp; Finance<br />
* Environmetrics &amp; Ecological Modeling<br />
* High Performance Computing<br />
* Machine Learning<br />
* Marketing &amp; Business Analytics<br />
* Psychometrics<br />
* Robust Statistics<br />
* Social network analysis<br />
* Spatial Statistics<br />
* Statistics in the Social and Political Sciences<br />
* Teaching<br />
* Visualization &amp; Graphics<br />
* and many more.</p>
<p>IMPORTANT DATES</p>
<p>*********************************************************************<br />
**   2010-06-20 registration deadline<br />
**                (later registration NOT possible on site)<br />
*********************************************************************<br />
2010-07-20   tutorials<br />
2010-07-21   conference start<br />
2010-07-23   conference end</p>
<p>We hope to meet you in Gaithersburg!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.r-statistics.com/2010/06/june-20-online-registration-deadline-for-user-2010/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
