Obfuscation of PII in Released Databases via Randomization

For the past year, my school, Twin Rivers Adult School, has been using an Excel spreadsheet I created as the main database for administering Pell Grants.  While Excel was a good tool to get our system up and running right away, it is not a good long term solution, and a database solution needs to be implemented.  As I talked about in other posts, the commercial solutions are expensive, and do not really meet our needs as a clock-hour based school.  Thus, this summer, myself and Steve Jensen, our Office Technician instructor, are having a special course for our advanced students to work with us to develop a new database.

This I believe will be a good “win-win”.  The students will get real experience working with a real development project, and get some specialized knowledge in Federal Student Aid, which in my opinion is an untapped vertical market.  Further, they will be able to earn lower-division credits from our school, and the adult school will be able to partially reduce its development cost.  (Although, these types of projects are never “free labor” on the part of the students.  There will be an extra investment of time by me and Steve to help the students, and while we hope this time will be less than the equivalent time we would need to do the development ourselves, we are not assured of this.)

But, one of the critical components for us as a school is security.  While we will have our students take a pledge and sign an agreement to not share any personally identifiable information (PII) they should incidentally contact, I still did not want the original data with PII to get disbursed via copying, etc.   So I created a spreadsheet that helped me to randomly assign PII in place of the real information, such as having fake names, emails, etc.   This way I can distribute the real Pell grant spreadsheet to the students, so they can work with real data, and see all the real scenarios the database will need to create, but at the same time, they don’t receive any real personal information about students.

I have placed this randomizing names spreadsheet online, for other database administrators and researchers who wish to obfuscate PII to be able to use.   Currently the spreadsheet does a good job with first and last names, and I hope to improve the middle name algorithm in the future, and also to define the probability distribution function of the the weighted method I have of determining first names and last names.

Leave a Comment

Your email address will not be published. Required fields are marked *