There's been some debate among pundits about where Barack Obama has been successful and why. To try to make some sense of what's going on, I decided to actually look at the data. (My pundit card will soon be revoked.)
One issue is how to compare across states given the change in the number of candidates running over time. The method I used is to focus on how well Obama did relative to Hillary Clinton by looking at the proportion of their total vote that Obama received, which (a) attempts to adjust for the departure of John Edwards and (b) contains much more information than simple win/loss tallies. (I also excluded the home states of Illinois, New York, and Arkansas and the largely uncontested states of Florida and Michigan from the analyses below.)
When you focus on Obama's proportion of the two-candidate vote, it's striking how he's run up huge margins in so many of his wins but his losses have almost all been relatively narrow:
Obama has won nine states with more than 60 percent of the two-candidate vote and three states with more than 70 percent, but he's only received less than 40 percent of the two-candidate vote once.
The first question is whether Obama is doing as well in caucuses as it appears. The answer is yes:
Weighting states equally, he's received an average of 66 percent of the two-candidate vote in caucuses and only 51 percent in primaries. Why? Kevin Drum's readers suggest the following explanations, which seem plausible, though the data can't really arbitrate between them:
Caucuses require organization and Obama was better organized. They require enthusiasm and he has more enthusiastic supporters. They require time, and his demographic has more free time. They're mostly in small states, and Obama targeted small states. They're dominated by activists, and activists tend to support Obama.
Another issue is how Obama's performance has varied according to the racial composition of the states in question. Despite my worries about a possible ceiling in white support, Obama has done well both in states with high black populations and heavily white "red states" (as Matthew Yglesias and other commentators have noted). The data indicate that this pattern, which is plotted below using a quadratic fit, appears to hold up across the full set of primaries and caucuses to date:
One claim I've seen thrown around to explain this pattern is the existence of racial threat. According to this story, Obama's race isn't an issue in overwhelmingly white states because race isn't salient there, whereas Obama can win in states with large black populations using a coalition built on black support. But in states with moderate black populations, race is sufficiently salient to reduce his vote totals among whites and he can't ride the black vote to victory in the same way as he does in more heavily black states. I'm not sure if that's true, but the data are at least broadly consistent with the story.
Another pattern observed in exit polls is that Obama has not done as well as Hillary Clinton among Hispanics. At the aggregate level, the data do show that he's done worse in states with larger Hispanic populations, though the association doesn't seem to be particularly strong:
Finally, Marc Ambinder of The Atlantic recently claimed that Hillary "can't win the small states (unless she controls the machine -- think Nevada)" while "Obama cannot win the states where the majority of Democrats reside." But as Yglesias argued, this claim seems to depend heavily on California:
This seems like a mighty gerrymandered "can't" for Obama. He can win Democratic states like Washington, Connecticut, and Delaware. He can win states the Democrats sometimes carry like Iowa and Missouri. Is the criticism that Obama can't win big heavily Democratic states? Well, he won his home state of Illinois and Clinton won her home state of New York. So this amounts to saying Obama lost California. Which, of course, he did. And it's a big state so California gets a lot of delegates. But one can hardly proclaim the winner of California the winner on some "states where the majority of Democrats reside" theory when Obama's winning more states and winning more delegates and winning them in all regions of the country.
Let's take a closer look. First, here is Obama's vote plotted against the log of state population (graphing by raw population is useless because California is so much larger than the other states):
As you can see, he has generally done better among smaller states, as Ambinder observed.
Turning to Ambinder's second claim, we can look at the Obama vote relative to the Democratic presidential vote in the 2004 election:
Again, we see that Obama has generally done better in the least Democratic states.
So is Ambinder right? One last way to assess the claim is to look at how Obama's vote varies with the log of state population*Democratic presidential vote, which roughly approximates the number of Democratic voters by state:
Once again, Obama appears to do worse in states that are larger and more Democratic. The question is why. One possible explanation is that it is harder (i.e. more expensive and time-consuming) for him to reach base voters in those states to move them off their default preference for Hillary. By contrast, in smaller and less heavily Democratic states, there are fewer caucus and primary voters for him that he can reach more effectively. Another possibility is that Hillary's elite support is stronger in larger and more Democratic states, whereas Obama has greater support from "red state" Democratic politicians who are concerned about Hillary's performance in the general election.
It's hard to separate the associations between these variables because larger states are (on average) more black and Hispanic, more Democratic, and less likely to have caucuses. But when we put all these factors together in a linear regression (including both black population and black population squared), we find that the U-shaped quadratic relationship for black population and the positive relationship for caucuses are statistically significant, while the other factors are not. In other words, the evidence so far is consistent with the conventional wisdom that Obama does best in heavily black and heavily white states and in caucuses and he does less well in moderately black states and primaries.
[Disclaimers: This is all just a rough cut at early aggregate voting data. We only have 30 observations so far. Finally, we can't make direct inferences about individual behavior based on aggregate data.]
Update 2/11 12:06 PM: Kevin Drum links and asks an additional question:
I'd add a caveat to this. Brendan actually finds that all five pieces of CW are true, but that the last three aren't statistically significant. In other words, there's at least a 5% possibility that they might be the result of chance.
But this is a one shot deal, and I wonder if the results are significant at, say, a 90% level? In an academic setting this wouldn't be good enough, but in a real-life setting where this is the only data you have (no followup studies, folks!), most people would probably think that 90% certainty was fairly convincing. For better or worse, it looks to me like the CW is likely true on all five counts.
To answer the question, the other variables aren't close to being significant. However, I wouldn't put too much stock in the results of any of these hypothesis tests because (a) hypothesis testing is riddled with epistemological problems and (b) it's difficult to achieve significance in small samples.
Update 2/11 3:50 PM: Josh Patashnik at TNR flags a more elaborate regression model predicting the Obama vote by the Daily Kos blogger poblano, who states that he "looked at pretty much every variable could think of that we can quantify about a state and that might affect the Obama-Clinton vote share" before putting together a model with nine variables. However, as Patashnik notes, the performance of the model over the weekend was "only so-so":
Poblano at Daily Kos has done a great job putting together a regression predicting Obama's share of the vote in each state. I'm not totally sold on it--it performs very well for the states that voted prior to when the model was constructed (which it obviously should, given that that's how the parameters were chosen in the first place), but did only so-so for this weekend's states (overestimating Obama's support in Louisiana and Nebraska, and underestimating it in Washington and Maine).
This is what's known as "overfitting" and it's the reason I didn't make predictions for upcoming primaries based on the regression model I discuss above. The problem is that model performance usually deteriorates dramatically when you make predictions out of sample (i.e. for new data). Poblano's search for explanatory variables is likely to make this problem worse.
Bill Bishop at the Daily Yonder (formerly of the Austin-American Statesman) also passes along two graphs showing that at the county level (rather than the state level) Obama actually did better in more Democratic counties in both California and Missouri:
The lesson is that the answers we get depend, in part, on the level of aggregation we consider. Remember the Gelman et al study of the relationship between income and party, which finds that the association varies by state income (PDF). In the poorest states, income is closely related to party affiliation, but the relationship weakens as state income increases. It's possible that something similar is going on here.
Finally, per Roger Ford's comment below, I pulled all the available exit poll data to look at how the white vote for Obama varies with the black population in the state, which is a more direct measure of the racial threat hypothesis above. Here's the graph of interest plotted with both a linear and a quadratic fit:
The quadratic relationship is statistically significant in a regression including the other factors listed above, though I'm not sure why white support for Obama would increase in heavily black states relative to moderately black states. (Whites there are more comfortable with minority elected leadership?) The linear relationship isn't statistically significant, but see above for the appropriate caveats about hypothesis testing.
Update 2/11 4:38 PM: The graph of population*Democratic presidential vote and the discussion of it above have been updated to correct an error caught by TerryVB. (Specifically, I switched the X-axis from log(pop)*presidential vote to log(pop*presidential vote).)
Update 2/11 9:52 PM: To try to understand variation in the white vote, I tried poblano's idea of using Southern Baptist population as a continuous variable that can proxy for "Southernness" (as suggested by IKL in comments below). And indeed the relationship between Southern Baptist population and Obama's white support is striking (and statistically significant):
Once you account for this variable, the relationship between black population and Obama's support among whites vanishes.
Update 2/12 10:06 AM: Plots of state education and income are inconclusive in bivariate form, though education is positive and statistically significant in a multiple regression as poblano notes:
The reason, I'm guessing, is that the education-Obama relationship only shows up once you control for Democratic presidential vote. If you disaggregate by states Kerry and Bush won, it seems to be positive for "red states" and negative for "blue states":
Update 2/13 9:49 AM: Pollster.com's Mark Blumenthal provides an accessible overview of some of the pros and cons of regression analysis for those who are not familiar with it. But it's worth noting one additional limitation. He links to another regression post by Jay Cost at Real Clear Politics which finds that Hillary Clinton does better in states she visits more. Cost suggests that this means her visits are effective in increasing support. While that might be true, it is also possible that Clinton is visiting states where she has more support (or where her support is increasing). Regression can't handle this problem, which is known as endogeneity, directly (different approaches are usually required). The general issue is that regression tells us nothing about causality; it can only tell us about possible associations between variables.
Update 2/14 12:42 PM: techne and other commenters argue that Obama's support in heavily white states is driven by the public nature of caucuses, which is possible. As such, I've made a new version of the graph showing Obama support by black population with a linear trend fit only to the primary states. This version is much more dramatic as a result of including Washington, DC (it also includes VA and MD):
Also, Cost makes a thoughtful case for why primary/caucus campaign visits might not be endogenous in a comment below.
It looks like you're missing the data point for New York.
Posted by: tom veil | February 11, 2008 at 02:17 PM
I wonder if there's a correlation between Obama's share of the white vote in a state and the share of the state's population that is black....
Posted by: Roger Ford | February 11, 2008 at 02:37 PM
It would be great if you'd include the r-squared values for those regressions. It would also be pretty interesting to see a multiple regression on all the factors.
Posted by: TW Andrews | February 11, 2008 at 02:43 PM
The Obama support by black population graph is interesting. It's quadradic fit is a function of two competing trends. In primaries, the blacker the state, the more votes Obama gets. In caucuses, the reverse seems to be true. My guess is that in caucuses, the more blacks are standing in obama's corner, the less comfortable whites are supporting him. In the privacy of the polling booth though, things are different. That's an interesting result.
Posted by: dan | February 11, 2008 at 04:12 PM
California is an outlier on so many dimensions that it's going to skew any analysis done that includes us. There are many reasons why Clinton won so handily here, most of which aren't covered at all above (early voting, the unique media markets in this state, the importance of particular power brokers in certain constituencies, etc.).
More importantly, as others have pointed out, the conclusions above are only relevant to the general election if you think there's a significant chance that Obama would lose states like California or New York to McCain.
Posted by: DarrenG | February 11, 2008 at 04:15 PM
I'm not so sure about your choice of "Log(population) * Democratic Presidential Vote" as the x coordinate of your last graph. Let's say P is the state's population, and D is the proportion of Democratic votes in 2004. Then if you want to approximate the number of Democratic voters by state, you want P * D. But, as you say, this puts California too far ahead for comparison. So if you do the same trick you did for 5th graph, you should be using Log(P * D) = Log(P) + Log(D), not Log(P) * D. What you've got there is Log(P^D).
Posted by: TerryVB | February 11, 2008 at 04:18 PM
Dan makes an interesting point. The phenomenon of Obama doing best in states with the smallest and largest black percentages of population has been discussed widely, and Brendan's curve shows that. But dividing the results between primary states and caucus states seems to yield a different result. Just eyeballing the data in Brendan's graph, it appears that if only primary states are considered, the result would be a direct correlation between percentage of black population in the state and Obama's margin. Only Utah seems to be an outlier in that case. Of course there aren't many data points, and plenty of other factors are present.
Posted by: Rob | February 11, 2008 at 04:28 PM
It seems to me that the key piece of information you're forgetting is where Obama spent most of his campaign efforts.
Looking at polls over the last year, Hillary Clinton started with a considerable advantage (20-30%) both nationwide and in most individual states. Obama tended to close the gap only in the last 2 months. There is a decided uptick in his numbers just before the election in most states, with the notable exception of Florida (where the candidates agreed not to campaign.)
My assumption is that Obama focused on the initial states (IA, NH, SC, NV) and then the smaller, less costly states before super Tuesday and his efforts paid off (I rarely saw any campaign ads in California).
Now with super Tuesday passed, candidates can focus state-by-state again. If my assumption is true, you'll see significantly better than expected numbers for Obama in the upcoming races.
Posted by: Jinchi | February 11, 2008 at 06:45 PM
The US isn't just some mass carved up into states with different proportions of whites, blacks, and hispanics, and democrats and republicans. There are pretty big regional variations in political culture. None of this analysis seems to take into account any of that. Which makes it close to useless, I think. I noted in the comments that you were making pretty unreasonable extrapolations from the behavior of whites in SC to the whole country and here you seem to be making a version of the same mistake writ large. Did you try variables for "southerness", or "westerness", for example? What about urban / rural or Catholic / Protestant? I guessing that "big state" is not a really useful analytic category, but any of the ones that I mentioned might be.
Posted by: ikl | February 11, 2008 at 08:44 PM
Also, the fellow on DailyKos used Southern Baptists as proxy for cultural southernness which was pretty clever given the increasingly complex political geography of the "New South" . . .
Posted by: ikl | February 11, 2008 at 08:47 PM
Bingo! Thanks Brendan! I told ya all along.
Posted by: ikl | February 12, 2008 at 01:11 AM
I'd be interested in seeing white religiosity or a broader religiousness measure per state among whites. Maybe use the state level turnout in the 2004 general from exit polls among whites who attend church regularly. Midwestern states like Wisconsin don't have a lot of Southern Baptists, but do have a lot of highly religious whites.
Posted by: Chris | February 12, 2008 at 10:15 AM
Question: What would be the delegate count if the Dems had a winner take all primary? Thanks
Posted by: | February 12, 2008 at 11:02 AM
Question: The bulk of these states happened on Super Tuesday. The difficulty of campaigning in 22 states at once in a very short window of time has me asking what if California had been held on a Tuesday like today (Feb. 12) -- fewer states, more camapaign time devoted to the state -- any difference? I guess my thought is that Obama doesn't have a "big state problem" but a "time in state" problem. Once that might not be so important in the general election.
Posted by: Jade7243 | February 12, 2008 at 05:54 PM
In population genetics, we call this sort of statistical analysis "story telling". This description is supported by the independent observation that these stats. will help you fall a sleep faster if read to you in a gentle, parental tone.
Posted by: cgwillis | February 12, 2008 at 06:20 PM
You need to provide the statistical significance (p values) for all of your fits. Just from looking at them, I can tell you right off that you do not have acceptably low p values (p = 0.05 is a typical cut-off in the scientific community) for many of the trends you report. Without statistics, those plots amount to reading tea leaves.
Posted by: Steve Collins | February 12, 2008 at 08:36 PM
Steve, as I try to make clear above, this is just an exploratory, descriptive analysis (see here for my real political science research). And of course, as I also try to make clear, many of these (apparent) relationships are not statistically significant, which is a given with such a small sample. However, as noted above, I don't put much stock in p-values.
Posted by: Brendan Nyhan | February 12, 2008 at 09:28 PM
Another vote for r2s here. And I hear you about the p-values, but reporting them corrected for the number of hypotheses tested would put them in perspective.
Given that you show a difference (pval??) between primary and caucus support for Obama, it'd be interesting (and defensible) to split analyses up that way. For example, the first graph, Obama support by black population--I eyeball there that the primary states alone would be a decent linear fit, while the caucus states wouldn't really fit any model on their own but would cluster well above the primary line. Wouldn't that be consistent with the Bradley Effect (and the supposition that the public nature of caucuses suppresses it)?
Posted by: techne | February 13, 2008 at 02:10 AM
Brendan,
Thanks for the nice mention of my piece. On the issue of endogeneity of my "campaign effects" variable - I had a few comments.
My intention of including the variable was as a way to differentiate the early and the big 2/5 states - Iowa, New Hampshire, South Carolina, Nevada, California - from the remaining states.
It seems to me that endogeneity would be a problem if and only if (a) candidates know a priori their standing in the states, (b) that knowledge induced them to change their campaign itineraries. I think this is a problematic case to make in the primaries.
In the general election - it is not a problem because candidates can develop early and confident estimates of their standing in the polls because of party identification. It serves as a quickie cue - i.e. McCain knows he is not winning Vermont and Clinton knows she is not winning Utah. But in a primary campaign - without party identification operating as a cue - there are problems with assuming that this variable is endogenous.
First, most of the polling in most of these states for most of the cycle is done of voters who are not paying attention. And, indeed, most of the polls have "broken" in one direction or another late in the cycle...*after* this independent variable has already been largely formed. So, even if a candidate like Obama saw that he was "down by 20 points" in New Hampshire as late as October (or whenever) - he is not going to alter his behavior.
Second, and relatedly, the polling has been very poor for most of the cycle. Of course, it is possible that internal campaign polling is getting it right while media polling is getting it wrong. However, I doubt it. My intuition is that all polling is "screwy" because of the absence of party identification as a guidepost. This is causing these late, dramatic shifts in the polls. There are simply more voters in the primaries who do not have an easily accessible cognitive framework for making their vote choice.
Third, the Democrats allocate delegates proportionally - thus, they have an incentive to campaign even if they know they are going to lose (e.g. Obama might have known he was going to lose Massachusetts).
A good case in point of all of this is Clinton in South Carolina. She had spent a lot of time and effort in South Carolina *before* the bottom fell out. As late as the fall - Team Clinton was bragging (if you believe the Atlantic) that Bill Clinton's relationship with African American pastors would ensure victory. She learned late that this was not the case. By that point, my "campaign effects" variable was already largely formed for South Carolina. Nevertheless, she still held a few late events in the Palmetto State. Her late realization that she should lose SC might have influenced this variable at the margins - but it was already essentially populated.
To be careful, I ran a test to see if this campaign effects variable violates Gauss-Markov 4. I picked up no relationship between it and the residuals, which is the kind of thing to expect if endogeneity is causing problems with the model itself.
Best,
Jay Cost
Posted by: Jay Cost | February 13, 2008 at 02:17 PM
Thank you for these very interesting statistics and analyses. Hope you don't mind a few comments, and suggestions from a mostly qualitative cultural anthropologist.
Beware the variable “Hispanic”. It is too simplistic and bears no relationship to Spanish speaking populations in the US who are very dissimilar. Voting patterns of Puerto Ricans on the East coast tend to be different depending upon whether born on the mainland or on the island. Cuban Americans do not vote the same way as Puerto Ricans, though there are beginning to be differences in Florida between first generation Cubans and their children. There is also a difference between Cuban Americans who came in the first wave and those who came in the Mariel boatlifts. All of these groups are also demonstrating variation by age set.
A significant group in the East who are ignored are the rising number of Dominican Americans – most first and second generation, and currently the largest foreign born population in NYC. They do not relate to the part of the traditional Democratic party machine controlled by Puerto Ricans. Close attention should be paid to the recent endorsements of Obama by unions like the SEIU who have a large Spanish-speaking and youthful membership who engineered the nod.
Mexicans are different from Mexican-Americans, and Chicanos in California are not the same as Tejanos in Texas. Few demographers and quantitative researchers have begun to tease out the growing populations of Hondurans, Salvadorans, and other Central Americans who are growing in number, both in California and Texas (as well as other states).
Complicating this further are the factors of “race” and identity. East Coast Puerto Rican, Dominican and Cuban communities have a greater percentage of persons who cross-identify as both black and Hispanic – even though there are also tensions around these identifiers. Mexicans and Mexican Americans do not.
Another variable within these groups that should be examined is the religious factor. Though the conventional wisdom is usually that “Hispanics are Catholics”, the rising influence of Evangelical Christianity in many of these communities should be examined more closely.
These are similar questions raised by attempts to predict the voting of “Asians” ignoring Japanese, Chinese, Filipino, Hawaiian, Vietnamese, Indonesian ancestry or nativity.
Posted by: Denise Oliver-Velez | February 19, 2008 at 10:15 AM