Information and Communication Technology

A Response from ScraperWiki

I was very happy to have received a response today from ScraperWiki, about the critique I posted yesterday about their new site.  I always appreciate having feedback that I take time to make to be considered, and here is what Zarino Zappia had to say:

Hi Jacob,

All valid and very understandable points. Your blog post has caused much debate at ScraperWiki HQ 🙂
I’m not sure I can fit a proper explanation of our reasoning into a single email, so all I’ll say is, thank you (really, thank you) for giving us your honest feedback, and keep an eye on for an upcoming series of posts which’ll hopefully shed more light on pretty much all of the points you’ve raised.
Zarino Zappia
VP of Product, ScraperWiki


The Email for FCMAT and CSIS is still down

Last week I posted about how when I tried to reply to an email from  of FCMAT / CSIS that I got an error back that “The email address you entered couldn’t be found. ”  Well, I’m still having this error, and of course, I can’t seem to email them about it! 🙂

It is important to recognize that this is the organization that is supposed to help school districts and others to have improved practices, and also manages the database of all the student information in California.  If they can’t get their email to work after over a week, it gives me some potential concern over the private data of nearly every child in our state.

I am going to try and call them today, and see if they can graciously acknowledge their issues, and fix them.  As that is the true sign of whether we should be concerned or not.  An organization that only tries to cover up and blame away their problems is one that obviously won’t fix the issues, and this is quite dangerous for the public when they are the keeper of your private information.  At this point I’m not making that accusation, but from my dealings with public agencies (and quasi-public agencies), my experience suggests that this could be an accusation that would need to be made, depending upon future evidence.

My Thoughts on ScraperWiki or Don’t You Just Hate when New “improved” Software is Worse!

A little over a month ago, I started to use ScraperWiki.  Which was a coding site dedicated to helping to get data from the web, and share it, and share the code.   It was almost a SourceForge for data, and had many of the things that I had wanted in the Data of Thought project that I’ve been slowly working on.

BUT, as with many websites, they changed, and unfortunately not in a good way.  First, they are now charging higher fees, and giving far less with their free account.   And of all the things about the changes, I understand this the most.  You can’t hire good staff, especially data scientists,  if you don’t have a business model that makes money.   But charging over a $100 a year for their basic account, which only gives 640 MB of space is ridiculous.  A 10 cent CD-RW has that much space. (Granted, you would never do any of this on a CD-RW, but it is important to see how their pricing is whack)   There are lots of web hosts that charge far less and give far more for a coder.

But, maybe the pricing would be worthwhile for having a good interface to do the scraping and a community to support the scraping.   BUT, again these are now worse.  While they say they have the same “Code in your Browser” feature, they changed this so that it no longer shows the panel for the console, data, and sources.  This was one of the best parts of their interface, as it really made debugging a scraper easier.   While I like their new way they show code in the browser, it is not worth having an interface that doesn’t specifically help code a scraper.  If I wanted a good standard IDE (Integrated Development Environment) I could just load one on my computer, instead of using them.

And  what about the community?  It seems to be gone.  I don’t see how I can share my scrapers with others, or look at their code, etc.    This negates any sense of them being a “wiki” as their name implies. I know they now have more integration with GITs, and this might be able to do the community thing, but they don’t make it very clear about this, nor have a seamless interface that clearly shows how this is the new community feature.

While I know they are going more after the market of journalists, and trying to make data analysis easier with their new “tools”, they didn’t have to make changes (other than possibly pricing) that takes away from what they have had before.  I am now needing to weigh my options, and see if the new features for the new pricing is worthwhile compared to how much stuff has been taken away.  I suspect that I won’t stay with ScraperWiki, but I want to at least give them a chance to address these issues, and then make my decision.

Integrating Python into Excel for Web Scraping, Advanced Functions, and More

For some time I’ve been wanting to learn Python.  It seems to be one of the best rapid-development open source languages to do all types of scripting activities and integration between different languages.  It also is very commonly used for scraping and advanced data processing.  And I also have realized from my past, that I want to invest my time in learning things that have long-term usability, and proprietary software, in general, does not necessarily have this benefit.

But I still want to use Excel.  While there are open source data analysis tools that are out there, they either don’t have as many features as Excel (as unfortunately Open & Libre Office don’t), or I haven’t learned them (as some of the open source math packages and languages are).  The feature problem is especially the case with data presentation, Excel can make nice looking charts, diagrams, etc, while most of the others take much more work to do these things.  (Maybe some day I will be able to help with the Open Source projects, and make them better, but for right now, I don’t have the time.)

So, my next incremental step to moving towards using more open source software, is to start to do any advanced programming for Excel spreadsheets in Python instead of VBA.  This is especially relevant to me right now, as there are many spreadsheets that I have, that I need to do web scraping for, and while I have learned to do some of this in VBA, I want to invest my energies into learning Python, which has a lot more code pre-written to do this also.

I also want to be able to create Excel files that can be easily distributed.  And this is where the rub comes.  There are several methods that have been developed to integrate Python and Excel, including using a DLL file, 2 Add-ins that have been developed, and the traditional method of using a Python COM server.  Generally all of these require external files, and sometimes external installations.  But I think I can solve this problem by embedding Python Scripts as OLEObjects and call these from Excel using VBA.   These can then setup the COM Server, and do everything else necessary.

While I’m sure I will run into some walls along the way, I think the idea is sound, and as I develop this more, I will share the VBA code that can be used to do this technique.

Excel VBA Macro Code to Find Word-Level N-Grams in a Text Entry

Excel has a lot of quantitative power, but it has always had issues with parsing text.  While there are the basic text functions (left, mid, right) and a useful substitution function, sometimes something that seems like it should be easy isn’t.  As a case in point, if you want to extract individual words from text, while it can be done with the Text to Columns feature using the space as a delimiter, it is not very easy to do with a formula, because you basically need to find where a certain space is within the text, which means counting where the previous spaces are, which requires a loop.  Which then requires VBA.  So, I did a little VBA coding tonight, and tried to make a function that not only can find single words, but also phrases, as n-grams.  I have now used this to do some qualitative analysis of text, where combining these functions with Pivot Tables, I have been able to have Excel be a decent basic qualitative analysis tool for text.

Here is the VBA file for the code, since when I tried to post the code inline with this message, my WordPress host thought I was trying to do some sort of injection attack or something.  Also you can find an example of how I used the code in the analysis I recently did of Literature and Medicine college courses.

Update: I have improved the efficiency of the code by using VBA’s split function instead of using a lot of extra loops.



Android Setup for a Samsung Admire (SCH-R720) with MetroPCS

This weekend I reloaded my Samsung Admire from scratch because my internal storage ran out of room.   I had also learned a lot over the past few months about which apps were the best.   With this new setup, my cheap Samsung Admire that was practically given away to me when I signed up, and which I’m sure was engineered to try and get people to buy better phones, has become a decent phone for my needs.   The following requires some computer expertise, and I would not go down this path if I didn’t have a good technical background, but for someone who is willing to do a little work, and get through the trial and error process, here is how to have a semi-decent phone:

1. Root the Phone and Load Superuser

If you want to do anything even semi-advanced on a phone, you need to Root it, and get the superuser app.  Instructions for rooting the Samsung Admire (SCH-R720) are available here.   One thing I found in this process, is that the batch file didn’t work right for me at first.  So I looked at all the commands in it, and just ran them one by one from the command prompt, and it worked just fine, even though there were some error messages along the way.   After rooting the phone, make sure you go to Google Plan and download the Superuser applet.

2. Setup the SD Card to be used for Apps

One of the major problems with the Admire, is that it has a very small amount of internal memory for loading apps.  Originally I had tried DroidSail Apps2SD to fix this problem, but this app and similar ones don’t really get all the parts of an app to be moved to the SD Card.  The real solution is to use the app Link2SD.  There is a guide here, that while for a different phone, is basically what you need to do to get this to work with the Admire.  As a point of caution, it took me several goes to get this to work, and I had to make both partitions “primary” and with Fat32 file systems. (Although other partitioning might work also.)   I also used the Bootice program that was suggested in this article.   When you do all of this stuff, you really should backup your SD Card ahead of time, as you will probably lose all of its contents while you are mucking around.  Somewhere in the process, my laptop stopped recognizing the phone as USB storage, and at first I thought my phone was at fault, but it turned out it was just something about my laptops driver, as it still works on other computers.

But, even though this whole process was time consuming and frustrating, I am VERY happy with the results.   All of my downloads automatically link the SD Card, and now after reloading all my apps from before, instead of only having 40MB free of internal storage, I have about 150 MB free.   Before with the low amount of internal storage available, my system was unstable, and that is why I needed to reload.   I recommend any new Admire customers who are tech savvy to do the Link2SD when they first get their phone, and save themselves a lot of future headache.

3. Get Titanium Backup

This free app does so much and works so well.  I previously had Root Uninstaller Pro, and it didn’t work nearly as well or do as much as Titanium Backup.   I used Titanium Backup before I set my phone back to factory defaults, and it did a nearly perfect job of recovering my apps and data afterwards.  The only thing I didn’t get back was my messaging history, but everything else worked.

4. Other Apps I really like

The first 3 things are what I considered required if someone wants to have a Samsung Admire and actually have it be useful for anything beyond the very basics.   But there are other apps that I have downloaded that I like (most requiring root)

  • WiFi Tether – If you want to use your laptops web browser with this and MetroPCS, you must have a way to make it look like a mobile browser, as MetroPCS blocks PC browsers.  I personally use Firefox with the User Agent Switcher add-on to do this.
  • Open Manager – An Open Source, non-ad based file manager, that requires root access
  • Advanced Maps Live Wallpaper – Always have a map of where you are in your background.  This is especially useful, since for some reason the built-in Google Maps app doesn’t seem to do this on the Admire, when it has this feature on most other phones.
  • Ralf’s Flashlight – There are WAY too many flashlight apps out there.  I just wanted one that didn’t only use the camera flash, because the Admire doesn’t have a flash, and didn’t have ads that stalked me.   This little app seems to work pretty well, and is very simple.


Best Currently Known Practice for Integrating Access Databases for Analysis with MS SQL

Currently at my school, we have a need to pull data from software that uses a Microsoft Access back-end.  One of the challenges of this, is that we can not tinker generally around with the MS Access database directly, because it will void any support we want to get, and it usually stops the product from working.   So, I’ve wanted to be able to have the information go into our MS SQL server, so I can access it with our Meta-Database (I talked a little about this idea in my previous post.)

After working on this problem for 20+ hours, and going down a lot of wrong paths, I found a solution that works.  To possibly help others in the same situation, I am writing this blog entry, to not only hopefully help you quickly to get to a solution, but at the same time to help you know why the other paths didn’t work for me, so that you can avoid them, or if you decide to tread down them, that you have some forewarning… 🙂

Final Solution: Linked Servers with Microsoft OLE DB Provider for ODBC and OpenQuery

After going down a lot of bad roads, I found that by creating an ODBC connection to my Access Database, and then using the Microsoft OLE DB Provider for ODBC to attach to that ODBC driver, I was able to get to my data.  But for some reason, I could not still directly query the data, but I could use the OpenQuery command to extract the data, and then use this to make the mirror views in my Meta-Database.

Step 1 – Setup ODBC

I made a System DSN that pointed to my access file.  Since my access file was on another server, I had a hard time typing in the UNC to get to it.  But I was able to get around this by pointing to it locally at first, and then doing a registry hack to point the UNC later.   I also found that if chose “read-only” in the ODBC driver, it screwed things up (I forget the exact error message off-hand, but it took a bit of searching forums to find that this caused problems.)

Step 2 – Setup the Linked Server

I used the following code to create my linked server

EXEC sp_addlinkedserver 
    @server = N'Name_to_use_as_Linked_Server', 
    @provider = N'MSDASQL', 
    @srvproduct = N'Access',
    @datasrc = N'ODBC_Name',
    @catalog = N'Path_to_MDB'
EXEC sp_addlinkedsrvlogin 
    @rmtsrvname = N'Name_to_use_as_Linked_Server',
    @useself = N'FALSE',
    @locallogin = NULL, 
    @rmtuser = NULL, 
    @rmtpassword = N'Database_Password' 

In this code the Name_to_use_as_Linked_Server should be whatever name you want to reference the database with later in MS SQL.   For simplicity sake, I generally name my ODBC connection and Linked Server, the same as the file name of the MDB.  So if I was connecting to sample.mdb, I would call all of those sample.

The @rvproduct apparently doesn’t really matter,   But the ODBC_Name does, although again, if you just use the same name as the MDB then this is easy.   An interesting thing to note is that the Path_to_MDB does not need to be what is defined in the ODBC driver.   So in my case I actually had multiple Linked Servers to Multiple MDBs all using the same ODBC connection (and so I actually broke my rule about having the ODBC driver having the same name as the Linked Server.)

I also was working with a password protected MDB, which made me go down a lot more wrong paths, as I figured that part out.  But by leaving the @rmtuser being NULL and then just having a @rmtpassword, it worked fine.  I suspect that if there were user accounts and passwords on the MS Access files, then the @rmtuser would need to be one of them, and you might run into issues with the System.mdw

Step 3 – Create the Mirror Views using OpenQuery

For some reason, I could never get my queries to work directly against the MS Access linked server.  I got a variety of errors, that all were ambiguous, and that I couldn’t track down a solution from any web searches.  But I found that if I used the OpenQuery command to run the query through the MDB instead of running it through MS SQL, that it actually would return what I was looking for, and by putting this into a view, I could then treat it as a normal MS SQL table (for the most part) in the future.

Here was the syntax I used to make the Mirror Views:

Use Database_Name
SELECT * FROM OPENQUERY(Name_of_Linked_Server, 'SELECT * From Table_Name')

Where Database_Name is the name of the Meta-Database, and View_Name is the name of the mirror View.  I use naming convention of LinkedServer_Table, thus if I had a linked server named EmployeeDB, and a table named Employees, I would name the Mirror View: EmployeeDB_Employees

So, doing these steps, the process worked, and I’m happy, because it does not seem to conflict with the third party software that uses the MS Access back-end.   And all the updates are real-time, so that when the third party software changes something, the mirror views can instantly see these changes.  Next, lets look at all the other paths I explored before coming to this solution.

Read the full article to find out about methods that were considered but not used

Best Known Practice for Integrating Medium Sized Databases for Analysis with MS SQL

I am always hesitant to use the word “Best Practice”, and even “Best Known Practice” is a little wrong.  But this is the best practice I know of, given the situation that I’m in.

While I’m aware that data-warehousing techniques might be better, the following is a fairly easy way to integrate many databases together, at least for the purposes of extracting and analyzing data.  They key is to make a “Meta-Database” that, for the most part, only contains Views.

Most of the views are what I’m calling “Mirror Views”  (I hope if there is a term that has already been coined, someone tells me, so I can use that word instead.)  Mirror Views are simply a View that Selects * from an equivalent table in another database.   By using these views, I can write additional views to work with them without having to write in the original database, thus I do not void any support options from the original database vendor.

In SQL Server, I use either sp_tables (for databases hosted on the server) or sp_tables_ex  (for linked servers), to retrieve the name of all the tables that I want to make Mirror Views from.   I am pretty sure I could do some sort of T-SQL set of commands to convert these into views, but my T-SQL isn’t that high, so I copied the output into a CSV, then into a spreadsheet and actually wrote formulas to make the Views in the form DBName_TableName in my Meta-Database.  (I would greatly appreciate if any readers who can share some code to do this in T-SQL, as I know the spreadsheet is probably not a best practice, but it gets the job done!)

This technique works easy for databases that are hosted in the SQL server, but for other databases hosted either on other servers, or that are on different platforms (such as MS Access, MySQL, etc.), then using a linked server is the way to go.  BUT, be forewarned, linked servers are not always as easy to setup as one would hope.   I will be writing more about this in a future posting, giving what I have learned through the school of hard knocks.

I am trying to move towards using Free and Open Source software more often, and so if anyone knows how to do something like linked servers in MySQL or Postgre, please let me know, as I’d like to use that solution in the future.

Obfuscation of PII in Released Databases via Randomization

For the past year, my school, Twin Rivers Adult School, has been using an Excel spreadsheet I created as the main database for administering Pell Grants.  While Excel was a good tool to get our system up and running right away, it is not a good long term solution, and a database solution needs to be implemented.  As I talked about in other posts, the commercial solutions are expensive, and do not really meet our needs as a clock-hour based school.  Thus, this summer, myself and Steve Jensen, our Office Technician instructor, are having a special course for our advanced students to work with us to develop a new database.

This I believe will be a good “win-win”.  The students will get real experience working with a real development project, and get some specialized knowledge in Federal Student Aid, which in my opinion is an untapped vertical market.  Further, they will be able to earn lower-division credits from our school, and the adult school will be able to partially reduce its development cost.  (Although, these types of projects are never “free labor” on the part of the students.  There will be an extra investment of time by me and Steve to help the students, and while we hope this time will be less than the equivalent time we would need to do the development ourselves, we are not assured of this.)

But, one of the critical components for us as a school is security.  While we will have our students take a pledge and sign an agreement to not share any personally identifiable information (PII) they should incidentally contact, I still did not want the original data with PII to get disbursed via copying, etc.   So I created a spreadsheet that helped me to randomly assign PII in place of the real information, such as having fake names, emails, etc.   This way I can distribute the real Pell grant spreadsheet to the students, so they can work with real data, and see all the real scenarios the database will need to create, but at the same time, they don’t receive any real personal information about students.

I have placed this randomizing names spreadsheet online, for other database administrators and researchers who wish to obfuscate PII to be able to use.   Currently the spreadsheet does a good job with first and last names, and I hope to improve the middle name algorithm in the future, and also to define the probability distribution function of the the weighted method I have of determining first names and last names.