One year in advance of the 2012 election, New York Times blogger Nate Silver published a presidential forecasting model. The model includes measures of presidential approval and economic performance -- standard variables in election forecasting models -- as well as a novel measure of challenger ideology that appears to have substantial effects. Based on this model, Silver estimates that "The difference between [Mitt] Romney and [Rick] Perry amounts to about 4 percentage points" -- a huge predicted effect that could easily swing the outcome of the election. Consider, for instance, Seth Masket's graphic illustrating how the predicted probability of a Republican win depends heavily on the estimated ideology of the GOP candidate:
Though candidate positioning is likely to influence presidential election outcomes, there are important reasons to question whether the challenger ideology effect Silver identifies is so powerful.
First, when the economy is growing and presidential approval is high, strong moderate candidates may be scared off from entering the race, leaving only ideologues. A similar effect has been shown when one party has held the presidency for a long period of time. When this happens, the opposition tends to perform better due to the perception that is "time for a change", and opposition parties are likely to nominate more moderate candidates in the hopes of regaining control of the White House at the expense of ideological purity.
Second, the estimates of challenger ideology that Silver uses are primarily drawn from voter perceptions of the candidates. However, these perceptions are driven by the content of the campaign, which is itself shaped by the economic context. Candidates who appear extreme in one era may seem less so in the next (consider the changing perceptions of Ronald Reagan between 1976 and 1980, for instance). For all of these reasons, Silver's estimates of the effects of challenger ideology and election outcomes are likely to be significantly exaggerated.
In addition, as we demonstrate below, Silver's model does not substantially improve the accuracy of presidential election forecasts, which casts further doubt on the importance of candidate ideology (see also Alan Abramowitz).
Silver's model includes three predictor variables - presidential approval one year in advance of the election, election year GDP growth, and an estimate of challenger extremism (i.e., the extremism of the candidate of the party that doesn't control the presidency at the time of the election). Using just three variables to predict the outcome of a presidential election may seem simplistic, but in forecasting simplicity is a virtue. With only 17 elections since 1944 to work with, including many indicators in a statistical model is likely to result in the identification of factors that are highly correlated with the election results we have already observed, but which do a horrible job in predicting the future.
For related reasons, Silver criticizes other forecasting models that use relatively obscure economic variables such as growth in real per-capita disposable income:
The government tracks literally 39,000 economic indicators each year.... When you have this much data to sort through but only 17 elections since 1944 to test them upon, some indicators will perform superficially better based on chance alone.... The advantage of looking at G.D.P. is that it represents the broadest overall evaluation of economic activity in the United States.
As Silver notes, there are legitimate reasons to worry that the search for statistically significant predictors will result in identifying indicators that perform well by "chance alone" (an extreme example: Washington Redskins home wins). Using such indicators can cause us to be overconfident in our statistical models (what statisticians call overfitting the data) and tends to make accurately predicting future events -- like next year's election -- very difficult or impossible.
As you might expect, scholars have spilled a lot of ink debating the best forecasting indicators for outcomes ranging from the paths of hurricanes to stock prices. But rather than have a philosophical debate, we can evaluate this concern empirically to determine the extent to which specific forecasting models can successfully predict election outcomes beyond the range of the data used to estimate them. In particular, if models are spuriously identifying chance relationships, then they should perform relatively poorly after the point at which they was first published.
To do so, we began with Silver's source data, which was compiled from the New York Times website and generously shared with us by Harry Enten.* Using a standard linear regression model, we almost precisely replicated the coefficients in the Javascript code for the interactive calculator on the Times website.
As a starting point for evaluating Silver's model, we first compare it with the Douglas Hibbs' "Bread and Peace" model, which uses the real per-capita disposable income variable described above. Silver has previously criticized the Hibbs model as performing relatively poorly outside the range of the years in which the model was first estimated (1952-1988). Here is the key graphic in question:
However, when we estimate both models using the same data as Silver's graph above (1952-1988) and predict the outcome for the 1992-2008 period in terms of the share of the two-party vote received by the party of the president (the standard outcome variable in the literature), we find that Hibbs's model generally performs better than Silver's (Stata data and do files available upon request):
Of course, these are not the only available models for comparison. Indeed, political scientists and economists have estimated dozens of other presidential forecasting models over the past twenty years. For example, PS: Political Science and Politics published a pre-election symposium in 2008 that included presidential election forecasts from numerous scholars (see also here, here, here, or here). Most such models make predictions based on economic conditions and/or public opinion, but they typically do not include a measures of candidate ideology.
While it is fun to compare the performance of these forecasts, we should be clear that there is no one "correct" model. Rather than relying on a single model, we can instead combine the forecasts of numerous models using a technique called ensemble Bayesian model averaging, which creates a composite forecast weighted by the predictive performance of the component models. This approach was developed for combining weather forecasting models (see here) and has been applied to political outcomes in a paper (PDF) by Montgomery, Ward, and Hollenbach.
The figure below, which uses the methodology described by MWH to create Figure 3 in their paper, compares one-step-ahead predictions from Silver's model, the most recent versions of six prominent models in the literature (Campbell's "Trial-Heat and Economy Model," Abramowitz's "Time-for-Change Model," Hibbs's "Bread and Peace Model," Fair's presidential vote-share model, Lewis-Beck and Tien's "Jobs Model Forecast," and Erikson and Wlezien's "Leading Economic Indicators and Poll" forecast**), and a composite forecast created using the ensemble technique. The forecast of each model is plotted with its 67% and 90% confidence intervals against the eventual outcome, which is represented with a dotted line:
Silver's model performs well in some elections, but it is very inaccurate in comparison to its peers in 1992 and 2008. With those exceptions, it does not appear to differ from other models dramatically, though its overall performance is worse on average than the comparison models. The ensemble forecast appears to perform quite well, producing predictions that are relatively close to the actual outcome.
The figure below, which is adapted from Table 4 in MWH, compares the accuracy of the models more precisely using mean average error -- an intuitive (though imperfect) metric for comparing forecasting models that measures the average amount by which they mispredict the final outcome.
We can see that all of the models are relatively accurate on average. They mispredict the vote share for the incumbent party by an average of 1.7 to 3.4 percentage points -- an impressive record given that most models include only two or three variables. By this metric, Silver's forecasts are the least accurate in the group and the ensemble forecast is the most accurate. (See MWH for a discussion of the extent to which these models appropriately express uncertainty about their predictions.) Since some of these models -- and implicitly the ensemble model that relies on them -- have been around for twenty years, this result should not be especially surprising. The literature on presidential forecasting is relatively mature.
At this point, we should note two important but wonky caveats. First, we follow MWH in using the most recent version of each of the forecasting models from political science. In some cases, model specifications may have been revised to account for previous results, which could artificially improve their performance in one-step-ahead prediction tasks (see footnote 14 in MWH). Second, these models differ in the extent to which we would even expect them to make accurate forecasts far in advance of a presidential election. For instance, Campbell's model includes the president's trial heat performance on the Labor Day before the election. Silver's model, on the other hand, takes on the more ambitious challenge of using approval data from a year before the election (though it relies on GDP growth during the election year, which is of course not available in advance).
Ultimately, almost every analyst agrees at this point that it is still too soon to say with much confidence whether President Obama will win in November. In particular, there is still too much uncertainty about the state of the economy next year. However, both theory and data suggest that the conservatism of his opponent is likely to matter less than Silver's model suggests.
Correction 11/10 1:19 PM: A previous version of this post stated that James Campbell's "Trial Heat and Economy" model uses presidential approval on the Labor Day before the election as a predictor. It actually uses the president's performance in a trial heat poll against his opponent on Labor Day.
Update 11/11 10:57 AM: We would like to be clarify that the model comparison we performed did not directly test whether adding estimates of challenger ideology to existing forecasting models would improve their performance. It would be desirable to do so. In the time that we had, our goals were to (a) raise concerns about causal inference and the difficulty of measuring challenger ideology and (b) compare the performance of Silver's model against others in the forecasting literature.
Update 11/16 1:48 PM: See also Silver's followup post and my response to him and other recent critics of election forecasting.
[Cross-posted to HuffPost Pollster]
* Silver's challenger ideology data are primarily derived from The Party Decides by Marty Cohen, David Karol, Hans Noel, and John Zaller. We used the exact data underlying Figure 4.1 in The Party Decides as provided by Noel. The Times presents these data on a 0-100 scale but the underlying data are actually on a 0-7 scale based on estimated distance from the ideological center. We used Silver's challenger ideology estimates for John Kerry in 2004 and Barack Obama in 2008 but convert them from the 0-100 scale to the original Party Decides metric.
** All of these authors generously shared their data with MWH.
One quick correction, I use the preference poll and not the presidential approval rating at Labor Day.
My major problem with Silver's model (other than not seeing the data and the estimation) is that it assumes a Downsian centrist strategy is optimal. I think we know now that base voters cannot be taken for granted. There is a price to be paid for being too centrist. As irrational as it may seem, extremists in both parties will stay home if they think their party's candidate does not represent them well enough. In effect, their seemingly irrational behavior gives them more leverage in the process. Just ask President McCain. All of this should not come as a surprise since turnout itself only makes sense if it is seen as an expressive rather than instrumental act.
Posted by: Jim Campbell | November 10, 2011 at 12:49 PM
All the models are correlated with each other, since they are fitting to basically the same set of underlying data, and of course the Ensemble is correlated with all of them. When you calculated the error bars for the Ensemble did you take correlations into account?
Posted by: Amit | November 10, 2011 at 01:00 PM
Apologies to Jim Campbell - the error is corrected above.
Posted by: bnyhan | November 10, 2011 at 01:24 PM
This is the coolest thing I've read all week! I wish there were more posts!
Posted by: JP | November 10, 2011 at 03:08 PM
Thank you for the corrective.
I'd still like to see someone note that RDI is actually a good measure of the underlying concept we're interested in: is the average person likely to see themselves as better off or not? As economic inequality has grown so dramatically, GDP is increasingly a bad measure of how average people are doing.
Posted by: Matt Jarvis | November 10, 2011 at 04:02 PM
Economic conditions and poll results are objective, whereas the public perception of a candidate's extremism is subjective. So inclusion of Extremism Index leads to considerations of Media Bias.
Many conservatives think Obama's previous record before running for President showed him as extreme. They point to:
-- his voting record, though thin, ranked him as the most liberal, according to one measure.
-- his 20-year membership in an extreme church
-- His association with (literal) bomb-thrower Bill Ayers.
However, candidate Obama portrayed himself as moderate and the mainstream media pretty much went along with pooh-poohing these three facts.
OTOH Sarah Palin had a fairly moderate record, again quite thin. Silver doesn't show her extremism index (since she was only running for VP), but I would guess it was higher than Obama's because she got such bad press.
Al Gore was another candidate who whose unkind media treatment was likely the key to his loss.
My conclusion is that replacing Extremism Index with a Media Bias Index might make the models more accurate.
(However, liberal academics don't like to admit a pro-liberal media bias. They don't see a liberal POV as biased; they see it as correct. It probably wouldn't help Brendan's career to emphasize liberal bias.)
Posted by: David in Cal | November 11, 2011 at 09:54 AM
Excellent article pointing out the limitations in election prediction models. The aptly named author, Sean Trende, lucidly explains ten assumptions implicit in all these models. He considers all of these assumptions questionable.
Trende argues, as Nate Silver recently argued, that one should not be too certain about the validity of any of these models.
Posted by: David in Cal | November 12, 2011 at 12:21 AM
@David in Cal, true that 'liberal academics don't like to admit to a pro-liberal media bias'. On the other hand conservatives do not like to admit a pro-conservative media bias in News Corp holdings either. People of all political beliefs want to read/watch news that reinforces their already held political ideaology and MSNBC/Fox News are there to fill that demand.
For fun check out this old Herman Cain clip singing Imagine There's no Pizza. Any analysis that does not include these singing abilities is missing the big picture.
http://www.youtube.com/watch?v=-DrSEyjBj1w
Posted by: JP | November 13, 2011 at 04:12 PM
What a great clip, JP! Thanks for the link.
You can tell that Cain has had some voice training or coaching. On words ending with an "er" sound, he properly sings "uh", so as not to get stuck in the "rrr". And, with dipthongs he holds the first vowel sound as long as possible. E.g., with a long i "ah-ee", he holds the "ah" through most of the note and only goes to "ee" at the very end.
Posted by: David in Cal | November 13, 2011 at 08:06 PM
I can't help but notice that in your chart comparing the 7 prediction models over 9 election cycles (for 63 total predictions), the result fell outside of the 90% confidence interval 11 times, or 11/63 = 17%... so Silver's fundamental critique that the modellers underestimate their error bars seems fair.
Although Silver's model falls outside of the 90% confidence interval twice (2/9 = 22%), even with his more generous error bars. :)
Posted by: Jacob | November 16, 2011 at 11:18 PM