February 11, 2008

Where Obama is winning and losing

There’s been some debate among pundits about where Barack Obama has been successful and why. To try to make some sense of what’s going on, I decided to actually look at the data. (My pundit card will soon be revoked.)

One issue is how to compare across states given the change in the number of candidates running over time. The method I used is to focus on how well Obama did relative to Hillary Clinton by looking at the proportion of their total vote that Obama received, which (a) attempts to adjust for the departure of John Edwards and (b) contains much more information than simple win/loss tallies. (I also excluded the home states of Illinois, New York, and Arkansas and the largely uncontested states of Florida and Michigan from the analyses below.)

When you focus on Obama’s proportion of the two-candidate vote, it’s striking how he’s run up huge margins in so many of his wins but his losses have almost all been relatively narrow:

Obama has won nine states with more than 60 percent of the two-candidate vote and three states with more than 70 percent, but he’s only received less than 40 percent of the two-candidate vote once.

The first question is whether Obama is doing as well in caucuses as it appears. The answer is yes:

Weighting states equally, he’s received an average of 66 percent of the two-candidate vote in caucuses and only 51 percent in primaries. Why? Kevin Drum’s readers suggest the following explanations, which seem plausible, though the data can’t really arbitrate between them:

Caucuses require organization and Obama was better organized. They require enthusiasm and he has more enthusiastic supporters. They require time, and his demographic has more free time. They’re mostly in small states, and Obama targeted small states. They’re dominated by activists, and activists tend to support Obama.

Another issue is how Obama’s performance has varied according to the racial composition of the states in question. Despite my worries about a possible ceiling in white support, Obama has done well both in states with high black populations and heavily white “red states” (as Matthew Yglesias and other commentators have noted). The data indicate that this pattern, which is plotted below using a quadratic fit, appears to hold up across the full set of primaries and caucuses to date:

One claim I’ve seen thrown around to explain this pattern is the existence of racial threat. According to this story, Obama’s race isn’t an issue in overwhelmingly white states because race isn’t salient there, whereas Obama can win in states with large black populations using a coalition built on black support. But in states with moderate black populations, race is sufficiently salient to reduce his vote totals among whites and he can’t ride the black vote to victory in the same way as he does in more heavily black states. I’m not sure if that’s true, but the data are at least broadly consistent with the story.

Another pattern observed in exit polls is that Obama has not done as well as Hillary Clinton among Hispanics. At the aggregate level, the data do show that he’s done worse in states with larger Hispanic populations, though the association doesn’t seem to be particularly strong:

Finally, Marc Ambinder of The Atlantic recently claimed that Hillary “can’t win the small states (unless she controls the machine — think Nevada)” while “Obama cannot win the states where the majority of Democrats reside.” But as Yglesias argued, this claim seems to depend heavily on California:

This seems like a mighty gerrymandered “can’t” for Obama. He can win Democratic states like Washington, Connecticut, and Delaware. He can win states the Democrats sometimes carry like Iowa and Missouri. Is the criticism that Obama can’t win big heavily Democratic states? Well, he won his home state of Illinois and Clinton won her home state of New York. So this amounts to saying Obama lost California. Which, of course, he did. And it’s a big state so California gets a lot of delegates. But one can hardly proclaim the winner of California the winner on some “states where the majority of Democrats reside” theory when Obama’s winning more states and winning more delegates and winning them in all regions of the country.

Let’s take a closer look. First, here is Obama’s vote plotted against the log of state population (graphing by raw population is useless because California is so much larger than the other states):

As you can see, he has generally done better among smaller states, as Ambinder observed.

Turning to Ambinder’s second claim, we can look at the Obama vote relative to the Democratic presidential vote in the 2004 election:

Again, we see that Obama has generally done better in the least Democratic states.

So is Ambinder right? One last way to assess the claim is to look at how Obama’s vote varies with the log of state population*Democratic presidential vote, which roughly approximates the number of Democratic voters by state:

Once again, Obama appears to do worse in states that are larger and more Democratic. The question is why. One possible explanation is that it is harder (i.e. more expensive and time-consuming) for him to reach base voters in those states to move them off their default preference for Hillary. By contrast, in smaller and less heavily Democratic states, there are fewer caucus and primary voters for him that he can reach more effectively. Another possibility is that Hillary’s elite support is stronger in larger and more Democratic states, whereas Obama has greater support from “red state” Democratic politicians who are concerned about Hillary’s performance in the general election.

It’s hard to separate the associations between these variables because larger states are (on average) more black and Hispanic, more Democratic, and less likely to have caucuses. But when we put all these factors together in a linear regression (including both black population and black population squared), we find that the U-shaped quadratic relationship for black population and the positive relationship for caucuses are statistically significant, while the other factors are not. In other words, the evidence so far is consistent with the conventional wisdom that Obama does best in heavily black and heavily white states and in caucuses and he does less well in moderately black states and primaries.

[Disclaimers: This is all just a rough cut at early aggregate voting data. We only have 30 observations so far. Finally, we can’t make direct inferences about individual behavior based on aggregate data.]

Update 2/11 12:06 PM: Kevin Drum links and asks an additional question:

I’d add a caveat to this. Brendan actually finds that all five pieces of CW are true, but that the last three aren’t statistically significant. In other words, there’s at least a 5% possibility that they might be the result of chance.

But this is a one shot deal, and I wonder if the results are significant at, say, a 90% level? In an academic setting this wouldn’t be good enough, but in a real-life setting where this is the only data you have (no followup studies, folks!), most people would probably think that 90% certainty was fairly convincing. For better or worse, it looks to me like the CW is likely true on all five counts.

To answer the question, the other variables aren’t close to being significant. However, I wouldn’t put too much stock in the results of any of these hypothesis tests because (a) hypothesis testing is riddled with epistemological problems and (b) it’s difficult to achieve significance in small samples.

Update 2/11 3:50 PM: Josh Patashnik at TNR flags a more elaborate regression model predicting the Obama vote by the Daily Kos blogger poblano, who states that he “looked at pretty much every variable could think of that we can quantify about a state and that might affect the Obama-Clinton vote share” before putting together a model with nine variables. However, as Patashnik notes, the performance of the model over the weekend was “only so-so”:

Poblano at Daily Kos has done a great job putting together a regression predicting Obama’s share of the vote in each state. I’m not totally sold on it–it performs very well for the states that voted prior to when the model was constructed (which it obviously should, given that that’s how the parameters were chosen in the first place), but did only so-so for this weekend’s states (overestimating Obama’s support in Louisiana and Nebraska, and underestimating it in Washington and Maine).

This is what’s known as “overfitting” and it’s the reason I didn’t make predictions for upcoming primaries based on the regression model I discuss above. The problem is that model performance usually deteriorates dramatically when you make predictions out of sample (i.e. for new data). Poblano’s search for explanatory variables is likely to make this problem worse.

Bill Bishop at the Daily Yonder (formerly of the Austin-American Statesman) also passes along two graphs showing that at the county level (rather than the state level) Obama actually did better in more Democratic counties in both California and Missouri:

The lesson is that the answers we get depend, in part, on the level of aggregation we consider. Remember the Gelman et al study of the relationship between income and party, which finds that the association varies by state income (PDF). In the poorest states, income is closely related to party affiliation, but the relationship weakens as state income increases. It’s possible that something similar is going on here.

Finally, per Roger Ford’s comment below, I pulled all the available exit poll data to look at how the white vote for Obama varies with the black population in the state, which is a more direct measure of the racial threat hypothesis above. Here’s the graph of interest plotted with both a linear and a quadratic fit:

The quadratic relationship is statistically significant in a regression including the other factors listed above, though I’m not sure why white support for Obama would increase in heavily black states relative to moderately black states. (Whites there are more comfortable with minority elected leadership?) The linear relationship isn’t statistically significant, but see above for the appropriate caveats about hypothesis testing.

Update 2/11 4:38 PM: The graph of population*Democratic presidential vote and the discussion of it above have been updated to correct an error caught by TerryVB. (Specifically, I switched the X-axis from log(pop)*presidential vote to log(pop*presidential vote).)

Update 2/11 9:52 PM: To try to understand variation in the white vote, I tried poblano’s idea of using Southern Baptist population as a continuous variable that can proxy for “Southernness” (as suggested by IKL in comments below). And indeed the relationship between Southern Baptist population and Obama’s white support is striking (and statistically significant):

Once you account for this variable, the relationship between black population and Obama’s support among whites vanishes.

Update 2/12 10:06 AM: Plots of state education and income are inconclusive in bivariate form, though education is positive and statistically significant in a multiple regression as poblano notes:

The reason, I’m guessing, is that the education-Obama relationship only shows up once you control for Democratic presidential vote. If you disaggregate by states Kerry and Bush won, it seems to be positive for “red states” and negative for “blue states”:

Update 2/13 9:49 AM: Pollster.com’s Mark Blumenthal provides an accessible overview of some of the pros and cons of regression analysis for those who are not familiar with it. But it’s worth noting one additional limitation. He links to another regression post by Jay Cost at Real Clear Politics which finds that Hillary Clinton does better in states she visits more. Cost suggests that this means her visits are effective in increasing support. While that might be true, it is also possible that Clinton is visiting states where she has more support (or where her support is increasing). Regression can’t handle this problem, which is known as endogeneity, directly (different approaches are usually required). The general issue is that regression tells us nothing about causality; it can only tell us about possible associations between variables.

Update 2/14 12:42 PM: techne and other commenters argue that Obama’s support in heavily white states is driven by the public nature of caucuses, which is possible. As such, I’ve made a new version of the graph showing Obama support by black population with a linear trend fit only to the primary states. This version is much more dramatic as a result of including Washington, DC (it also includes VA and MD):

Also, Cost makes a thoughtful case for why primary/caucus campaign visits might not be endogenous in a comment below.