For the next few weeks, including today and my last post, I will be sharing new research on New York. It is “new” because I am doing it from scratch, but it is informed by some of the things found when I was research director at NYCA. the new material also has the benefit of a couple of new database files, helpfully supplied by New Yorkers who care about integrity in elections.
Several years ago, I asked the team to find out how many people had voted in the 2020 election by performing a count of the voter history field. Almost immediately, I was told this wasn’t as straightforward as it sounded. The issue was that the voter history was a concatenated field. Concatenation is when other fields are strung together like a math equation. For instance, if a first name field containing the name “Seth” is combined with a last name field containing the name “Jones”. you get the concatenated Full Name field “Seth Jones”.
This was likely done in NY to reduce administrative overhead. There are over 20 million records in every version of the database I’ve seen. 23 million in the most rect, dated August of 2025. For each field, and there are 47 of them right now, that’s 23 million pieces of information. If the voter history field wasn’t concatenated, they’d need hundreds of extra fields, one for every election. And each would be multiplied by 23 million. So, they concatenate. It makes sense.
Extracting specific elections from the voterHistory field isn’t a problem if you are looking up one or two records. If you want to see how many people in total voted in a certain election, that can be a problem. Some are easy, “YORKVILLE VILLAGE ELECTION 2013” probably only exists in one form and can be searched for. “2020 General Election” has 16 variants, making it much more difficult to count.
What I had to do over the past couple of days was make a list of every term used to describe an election in NY. There are 6,745 of these. That is, 6,745 extra columns that would have to be added if not for concatenation. That is, if they each uniquely described a unique election. They don’t. There are 23 different terms used to describe the 2012 election. Even worse, I had to make a judgment call on some of these.
In 2008, they apparently had 11 General Elections on November 1st, 2008 through November 11th, 2008. At least it looks that way in the records. In addition to all the normal variations, they have an additional 11 entries in this format, ‘20081101GE’ in addition to four in this format, ‘20081101’, for 13 variations in excess of all other elections. The question is, “did they really hold eleven different general elections in 2008?” I don’t know. When I looked it up, I found that in 2008 (Obama’s first election), the election was held on 11/04.
For counting purposes, I decided to use 11/04/2008 as the date, and discarded the rest. The fact I had to make a decision introduces the possibility of error. That is true regardless what the decision is. If I had kept the other dates, that might have multiplied the count beyond the reality. That wouldn’t have been good either. In my view, including non-official election dates posed more risk than not counting them. The 2020 election had a date like this also. The election occurred on 11/03/2020, but there are a small number of entries dated 11/06/200. I only counted the 11/03 votes, reasoning that the few 11/06 votes wouldn’t change anything if it was the wrong choice.
After finding every single election description, I had to filter out the ones I wasn’t interested in: primaries, special elections, local and state only elections, non-federal elections, and elections held before the year 2000.
Then there was the problem of vote method. Many entries had variations signified by initials enclosed by parentheses as a suffix. These denoted vote method, but did not serve to distinguish one election from another. Those had to go. In the end, I had the list shown at the top of this post, but it extends back to 2000 rather than stopping in 2016 in the image.
With the election term “dictionary” available, I could then perform the counting operation originally performed years ago. The results were a bit better, but still don’t reconcile with the certified totals. Keep in mind this is a raw count of the voter history entries in the voter rolls. It is based on the election dictionary made by combing through the database for every term used to describe these elections.
There are some things not taken into account. For instance, Schoharie County recently changed their format so that their results for the last couple of years don’t include dates, only election type. Those couldn’t be counted. However, Schoharie has less than 35,000 registered voters in the 2022 database, which isn’t enough to account for the count discrepancies, particularly since this only applies to the last two elections. Before that, Schoharie did include date information.
Another issue is clone records. If two records assigned to the same voter have different ID numbers and both show a vote for the same election, only one vote should be counted. Since we know this occurred thousands of times (even with an assistant district attorney), not factoring these in results in an overcount.
With the exception of the 2016 election, where the certified results were about 400,000 votes less than what the voter history tells us, all other elections have certified totals higher than what the voter history tells us. However, this number is inflated by the presence of clone votes, so the actual discrepancies are lower.
Of greater concern to me is that in each successive snapshot of the database, the count of votes increases. Since most of these elections occurred one or more years before the snapshot was made, the vote counts should be frozen. They aren’t. Due to this, there is no question that there will always be a discrepancy with the certified total, because the totals in each snapshot are different.
This is problematic because voter rolls should be serviceable as a practical way to reconcile the ballot count with the number of people known to have voted. If they don’t match, then either fraud occurred or the counting system is too unreliable to be trusted and must be fixed or replaced.
I have been unable to find a law that requires using the voter rolls this way, but think there should be one. To work, the rolls must accurately record each vote as it is cast by correctly attributing it to the right voter in his or her voter history. If that was done, this would be among the easiest ways to quickly determine whether an audit of election results should be done, and no election should be certifiable if a ballot/voter history reconciliation check fails.
One of the things that you're highlighting, Andrew, is just how bad the election-related data management systems are in these states. The data structures they are using are just not suited for what they are trying to do with them. They are required to provide a "voter roll," but a voter roll is not a voter history, so they are mashing voter history information into the roll. So, it's basically inevitable that they end up with the problem that you're seeing. As citizens, we should be demanding that they redesign everything from the ground up with a focus on auditability. The system needs to be such that all the citizens can download the data and "know what happened." Ideally, it should be possible to virtually "replay" any election going back decades. Tens of millions of records might sound like a lot to anybody who's not a "computer person," but it's really small potatoes. Storage is cheap, so we should capture EVERYTHING and retain it. Throw nothing away. Create timestamped activity records for everything. Don't merely provide current status, but provide detailed HISTORY: not just the current state of a record (e.g., the current name, current address, and whether it's currently active or not), but when was it created, who created it, when was it modified, what was the modification, who modified it, in which election did it vote and on which day and time, when did it become inactive, and who deactivated it? We should be able to answer all those questions with our databases. The fact that you get merely a 1-table CSV file from the state is atrocious.
Clearly, this system is sick and should be killed off -- start over -- sequential database.