This first article on the methodology behind HowWillAmericaVote.com will serve as an introduction to the various assumptions, limitations and rationale behind the site's inner workings. My intent is to provide an initial overview and then, at a later date, supplement specific aspects in greater detail.
The central purpose of this site is to collate polling data; this is complicated, time consuming and sometimes inexact. This section will address what data I collect, how I acquire it, and what the site does with it.
Collation of Polling Data
Every sample based poll in existence, has an entity that conducts it, and an entity that pays for it; in some cases these entities are the same. In this context, the term "entity" is used in the abstract and encompasses any association of people, organizations or political institutions. On this site, the conductor referrers to the entity responsible for the sample design, data gathering and subsequent analysis; sometimes referred to as a pollster. The sponsor subsequently provides the necessary resources to the conductor, generally through financial support. I make every effort to determine who conducted the poll, and for whom it was conducted.
When the conductor and sponsor are the same entity, I only denote the conductor on the poll page to conserve space. Although this data is cataloged and stored, there is no user interface currently provided to perform aggregate analysis of either the conductor(s) or sponsor(s) within some set of polls; this functionality is planned, but not currently implemented.
Once the conductor(s) of a poll are known, I logically want to understand how the poll was conducted. I break this down into two categories; the Communication Method by which the poll was conducted and the Interaction Type:
Automated, Mail, Live, Internet, OtherInteraction Type:
Cellphone, Telephone, Landline, In Person, Interviewed, Mail, Internet, Opt-In, Other
The Communication Method can be thought of as the means by which the conductor relayed information to and from the participant. There can only be one Communication Method per poll. The Interaction Type more directly references the medium by which the communication occurred; each poll may have multiple Interaction Types. In some cases, the Communication Method implies the Interaction Type; if a poll is conducted by telephone with a pre-recorded speaker, the only legally (47 USC § 227) reachable participants are those with a landline telephone. Automated calls to cell phones are banned by the Telephone Consumer Protection Act (TCPA); individual states, such as Indiana, impose further restrictions. By the same token, an automated interaction does not necessarily imply that a telephone was used, it may be an internet based question or some sort of text messaging system. In general, most polls are conducted using telephones and further information is usually available. If I am unable to categorize a given poll, I will contact the pollster and ask for additional information. This information is retained and may be used to categorized future results from the same pollster; I intend to publish our internal list at some point in the future.
This two category representation does not accurately encompass every possible scenario; on occasion it is necessary to quantize the actual procedure of a given poll into our simplified model. When this approximation is necessary, I preserve the "better" process and provide an annotation explaining the situation. This approximation issue happens very infrequently, and I can't guarantee that every instance has an annotation, especially on some of the older polls.
Most polls are generally conducted with a homogeneous group of participates that all fall into some common set of criteria. This is generally straightforward with political polls but there are some special exceptions. The demographics the site accounts for are:
Adults, Registered Voters, Likely Voters, Unknown
Each demographic is contained within the preceding (from left-to-right) demographic; all registered voters are adults, and all likely voters are registered. When results are provided for multiple demographics from the same question, the most specific demographics' data is preferred; this inability to reference multiple demographics in the same question is a limitation of the current implementation. Our model is able to associate these different demographic samples with separate questions, but a distinction cannot be made within the same question. Sometimes this deficiency forces us to choose between more data from a less specific demographic or less data from a more specific demographic. When presented with this situation I generally prefer to include the additional data from the lesser demographic; sub-sample results are always preferred if relevant.
Finally, question selection and ordering. Every effort is made to determine the question ordering and the exact wording but this information is not always available. The accuracy of the ordering and wording is denoted on the poll page as part of each question. When there are multiple questions/results associated with a given election, the data with the greatest number of currently participating candidates is preferred. Our definition of "participating" includes a candidate who will either appear on the ballot or is involved in a substantial write-in campaign.
All data is manually entered and often requires tedious and time consuming research. Frequency data, when available, is preferred over percentages. Demographic data and detailed cross-tabs also require additional time; this is however what differentiates us from every other poll aggregation site currently in existence. This site currently tracks two demographics (when available) for every poll; gender and party affiliation, but could theoretically support many more if these aforementioned timing constraints were alleviated. All additional and unentered demographic data is however archived for future use.
Poll data comes from all sorts of places; news media, press releases, newspapers, tv, poll aggregators, blogs, etc. There are number of resources I check on a daily basis; there are usually more new releases then time. I generally sift through each day's haul and enter the polls that provide some level of personal interest. I focus on several elections, and archive the rest; on each election's matchup page, a sort of warning message will be displayed if I am not actively maintaining poll data for this specific election. I'd ultimately like to have current and up to date data for every upcoming election, but it's simply not possible given the scarcity of time.
I have a number of automated programs that index specific websites for polling results and archive the results for historical accuracy. I must still manually analyze the data, but it provides a snapshot in time which is useful when entering older polls. I also use these programs to record the real time results of various elections.
For each poll I enter, I attempt to reference the resources from which I gather my information. This is generally a website, but I also save a mirrored version in case the original resources falls into disrepair or changes. I always prefer primary sources when available, but that isn't always possible. Occasionally the format of the resource is changed when I mirror the data; this is generally done with dynamic content to ensure that the relevant data is actually preserved.
How the Math Works
Each election matchup is associated with some collection of polls; aggregate analysis is then calculated using this set of polls. A poll from one election matchup does not in any way affect or influence a poll from another matchup. For example, no implicit relationship exists between a poll conducted in Minnesota and one in Wisconsin; while the states may be geographically and demographically similar, no external information beyond that contained within the aforementioned collection of polls is currently used in any calculation. There is no complicated mathematical model, exit polls are not incorporated, historical results and every other piece of contemporary information is ignored; the poll data stands alone with its own original biases for better or worse.
The objective is to compile the most detailed collection of polling data, and then provide the user the capacity to do with it what they choose. HowWillAmericaVote.com currently offers two regression techniques with several user customizable fields; a simple polynomial least squares regression (LSQR) and a more complex local regression (LOC). These are basically goodness of fit indicators which provide good trendlines and a confidence estimation. I'll describe each of the options for these to techniques briefly below; I intend on supplementing this information in a more rigorous fashion at some point in the future.
Polynomial: This is a basic vanilla implementation least squares with the variance calculated as Residual Sum of Squares / Degrees of Freedom.
Degree: This is the degree of the polynomial used for the approximation and essentially defines the curvyness of the resulting line. If there are bigger swings with the polling data, a higher degree polynomial will better account for these changes.
Local: This is a window based kernel smoother that uses a Tricubic kernel; basically greater weight is assigned to neighboring data points. The variance is estimated by using the normalized residual sum of squares such that the residual degrees of freedom produce an unbiased result; this is done at the 95% confidence interval.
Degree: Same as above.
Alpha: This defines the smoothing window as a percentage of the observed data around a given data point. A larger number will include more data, while a smaller number will be more exclusive. A nearest neighbor algorithm is performed to determine the bandwidth at each point.
Bandwidth: This is similar to the Alpha parameter, but is statically defined. The regression uses the greater value of Alpha and Bandwidth at each data point.
Sometimes the specified input parameters coupled with the regression technique results in a really bad fit; this is usually very obvious and typically results from having sparse data. The math is working, but it just doesn't provide a very good fit; try modifying the parameters.
The entire site was developed and coded in C# with an SQL database. There are definitely some performance issues, but for the most part everything works well. Everything is generated in real time; if you're looking at a graph it was generated using the most current data. I don't currently cache anything, but I may start pre-rendering the graphs depending on the server load.