I recently received two fresh copies of New York’s voter rolls, dated July and August of this year. For the past 36 hours or so, I have been busy processing them. I also wanted to be sure they were complete before I started writing about them here. For those who remember, I had an incomplete copy last week, but had to pull the article I wrote about it when informed it wasn’t a full copy. I don’t want to do that again.

First; the new rolls are complete. Second, they show that much of what needed to be cleaned up two years ago still needs to be cleaned up now, no matter what Tom Fitton might say. Third, the reason I didn’t just post a quick update is that I wanted to make some direct comparisons with the older records. Doing that requires standardizing the files so that they all have the same fields, all errors are corrected, etc.

The act of cleaning up the files prior to analysis is what is taking all my time. This “clean up” is not the same as the cleaning referred to in the previous paragraph. That refers to serious systemic inaccuracies that require correction under the Help America Vote Act (HAVA) to be compliant with federal law. The cleaning up I am doing right now preserves all of those problems, but organizes the existing data in a non-destructive way so that different database snapshots can be analyzed apples to apples. This is required because there are differences in the way some files are organized that have to be straightened out.

For instance, the 2025 voter rolls have a couple of fields that don’t exist in the 2021 files. One of them may have been a reaction to my research. A colleague at NYCA and I had found tens of thousands of records that had no address, making them incomplete. We communicated this to the Board of Elections and some county commissioners, who never officially responded.

In the next version of the voter rolls, there was a new field, “NONSTDADDR”, or “Non-standard address”. In that and subsequent versions of the rolls, all blank address records were now “non-standard address”. This meant that the address in the county records (assuming there was one) couldn’t be entered in the state system. This is frustrating for anyone who wants to canvass addresses, but from the NYSBOE’s perspective, it dealt with the problem by accounting for it even if the records still didn’t have addresses.

Another problem is that some entries use unusual characters, such as from the Cyrillic alphabet, to spell foreign names. These cause an offset in the data, where everything gets shifted to the right. If this happens in the first name field, it can shove all subsequent data over one column so that things like ZipCode are found in the election district (ED) column instead. As you can imagine, this can make it difficult to analyze the data. I’m fixing that now. Let’s check on my progress:



2025-09-05 21:14:26,220 - INFO - Processed 16,000,000 records...

Only five million or so to go for this file, then it has to be done for the next three.

It is interesting that voter rolls are public, but the rolls are not very useful in the form delivered to anyone who asks for them. This doesn’t mean that most of the records are inaccurate. Most are probably fine. However, the errors are numerous enough to either cast doubt on the rest, or make fully accurate reporting impossible. As it is, I think any findings should have an asterisk beside them and a disclaimer that they are based on “best available information”.

One problem in the voter history field is the most complicated data cleaning issue in the file. Each county records elections differently. For instance, one county will record a vote in the 2020 General Election as “20201103GE” and another will use “General Election 2020”. There are around 11 different ways I’ve seen to record each election. This confuses software used to detect identical entries for identical events, when the entries aren’t identical.

Schoharie in recent years defines the election year in the column header, then puts the type of election in cells below. When this gets transmitted to the state, which concatenates (combines) the voter history data, the years get stripped. Therefore, general elections in 2024, 2023, 2022 are recorded as “General Election, General Election, General Election”. It’s possible to figure out the year if the voter voted in every election after the format change was made. If not, then there is no way to know which year a single vote was cast in. For that reason, these Schoharie voter history records are useless for checking on duplicate votes.

This is what a “duplicate vote” looks like in the data:

“General Election 2020 (A), General Election 2020 (P”)”

What this line tells us is that there was an absentee vote cast (A), and then someone with the same ID number also voted in person (P). I first noticed this in 2022.For the 2020 election, there were over 225,000 of these entries. My theory is that the county received the absentee ballot, recorded it, and sent it to the state. Then, the in-person vote was cast, which overwrote the absentee ballot noted in the county’s voter history field, and was sent to the state. At the state level, voter history is additive, so it allows long strings of data to be added to the same field over the years. This is done to reduce the number of columns needed to represent all the elections. It also causes the prior absentee vote to be retained, thus creating the “double vote” phenomenon.

2025-09-05 21:29:24,592 - INFO - Processed 20,500,000 records...

For me, the issue with these double votes isn’t whether they represent actual fraudulent double voting or not. The issue is that these records violate HAVA whether or not they represent fraud. Here are the possibilities:

The entries are true. If so, then they do represent fraudulent voting, which violates HAVA and other laws. The entries are false. If so, then they violate HAVA because the records aren’t accurate. Even worse, they make innocent citizens appear to be guilty of prosecutable election fraud.

When I originally looked into this, I only looked at the 2020 election. I knew other elections were impacted, because I could see double voting in other years. However, I didn’t have the time to follow through and count them. Now, I want to look at other years as well, to get a better understanding of the scope of the problem.

I wanted to write about this yesterday, but ran into some of these data cleaning problems. My adjusted timeline optimistically anticipates results by tomorrow.

If you wonder about the relevance of the illustration for this article, here it is:

In this cover, a woman reports a suicide to her insurance company. The image gives us all the information we need to determine it was murder, not suicide. From the insurance company’s perspective, it is suicide unless they investigate and discover differently. Meanwhile, the crooks are covering their tracks by wiping fingerprints from the gun.

Elections can be like that too.

2025-09-05 21:43:31,170 - ERROR - Error during processing: Unable to allocate 6.81 GiB for an array with shape (20765243, 44) and data type object