The Inherent Bias in LinkedIn Featured Skills & Endorsements

A network graph of LinkedIn logosFor anyone who seriously uses LinkedIn, you know that your profile is really important.  Our profile is our modern day resume, and it is how we will be judged by potential clients and employers.  One part that is especially important are our listed skills.   I have been fortunate that many people have endorsed me for many skills without me ever needing to resort to any form of “hacking“.

But I have come to realize that there is an inherent bias in the LinkedIn Featured Skills & Endorsements, and it is one that cannot be easily fixed, because the bias has to do with who knows what you do, and feels competent enough to make a judgement about it.

Read More →

The Four Major Activities of Data Science / Machine Learning

Recently there was a post on LinkedIn by Erle Hall, lead for the Information and Communication Technologies (ICT) for the California Department of Education (CDE) with a diagram about machine learning.  That diagram had 6 steps: Select Data, Model Data, Validate Model, Test Model, Use the Model, and Tune Model.    Those 6 steps mostly encapsulate what traditionally has been called the “data mining” phase.  But there are 3 other important phases, which I will call “data surfing”, “data wrangling” and “data artistry”.  (These names were chosen to be easier to understand and more interesting for students, but also go by different names)  I also personally prefer to use the term “algorithm” instead of “model”, because while traditionally in data science, statistical models were used, there are now often times methods like neural networks and other such algorithms that are less like a traditional statistical model.  In the next few posts, I’ll dive into each of these 4 steps, and give a basic explanation of what each step does, and why the step is important.

Data Surfing: The Oft Forgotten First Stage of Discovery

You got to drift in the breeze before you set your sails. It’s an occupation where the wind prevails. Before you set your sails drift in the breeze.” – Paul Simon

Many texts about data science (including machine learning, data mining, and predictive analytics) don’t include much about the very first step of the process, which is the step where you come up with what your goal is for your other steps.  In traditional science, this might be called the step of making your hypothesis.

Read More →

Data Wrangling: Gathering the Data You Need in a Form You Can Use

Data! Data! Data!’ I can’t make bricks without clay.” – Sherlock Holmes

Before data science/machine learning/data mining/predictive analytics can be done, you need to have the data you are going to use.  This may see obvious, but in many cases there is more to this step than may first be assumed, and the whole process is what I will call “data wrangling”, although has other names like “data munging”.

Read More →

Data Mining: Discovering Gold in your Data

There’s gold in dem dere data!” – Adaptation of the original quote from M. F. Stephenson

After the data has been gathered and in a form that can be used, it can then have an appropriate algorithm used to accomplish the data mining/machine learning/predictive analytics. This is the stage that traditionally has been called “data mining” because it is the part that gets additional value from the data in the form of some type of knowledge (this is why early on, the process was sometimes called “knowledge discovery in data” (KDD).

Read More →

Data Artistry: Using and Sharing the Knowledge in an Effective Manner

Can you picture that?” – Dr. Teeth and The Electric Mayhem

The final stage of doing data science/machine learning/data mining/predictive analytics is to use the results, which generally involves some form of communication to one or more types of audiences.  This, I will term “data artistry”. (This is not necessarily a common term used, but it does have some precedence in specific contexts)

Read More →

Big Excel Spreadsheet Tip: Don’t Use Smart Tags to Convert Text to Numbers

Don't do this for big spreadsheets!
Don’t do this!

I often work with really big Excel spreadsheets (>500,000 rows) to do various types of data analysis.  And often I’m grabbing data from databases (how else are you going to get such big spreadsheets???)  And a common thing that can happen, is that numbers will be copied in as being in a text format, and many math and lookup formulas won’t work on them.  One tempting way to convert these is to use a “Smart Tag” where, when you select all the numbers that are being treated as text, you get an ! in a diamond, that you can click on, and it gives you the “nice” option to just “Convert to Number”.   If your spreadsheet is over 10,000 rows, DON’T DO THIS.  It will take forever to complete…  (ok, not really forever, but possibly several days to complete.) Instead you should do the following:

Read More →

Why any Emergency Alert System that has a Smartphone App Panic Button that only uses the Internet is a Death waiting to Happen

Our school has been researching various emergency alert systems so that if an emergency happens, we have a manner that staff can quickly and reliably communicate to our administration that there is a problem, and that we can then quickly send out messages as appropriate to our students.

Most emergency alert systems have a smartphone app with some form of panic button.  This is great, except for the fact that these systems all only use the phone’s internet connection, and if that internet connection is not working, then they fail.

Read More →

Fixing Formula Auto-Fill / Replication / Calculated Column Problem in an Excel 2016 Query Table

One of the great new features of Excel 2016, is that it has Power Query and Power Pivot built-in.  This makes connecting to a database a LOT easier, where a database table, view, or query will get imported into an Excel table, and then new columns can be added to that table with formulas.  For folks like myself who are good with Excel formulas, this is often easier (and sometimes more powerful) than trying to have the SQL query itself do calculations.

But, I had run into a problem, which was that on one of my spreadsheets, every time I would refresh a query, all the new data from the query would not have the formula column automatically fill in like I was expecting.   And since I had a fairly complex spreadsheet, if I forgot about this problem, I would get bad data and reports….   So I needed to fix it.

Read More →

Using Match and Index on a Pivot Table: Sometimes Excel has better Performance with “Old School” Formulas

I am a huge fan of pivot tables, and often use them to help transform raw data into a different table, by “mirroring” data from a pivot table, where I have a table to the left of the pivot table, and I want to have the table show the same value the pivot table has, or possibly transform it with a formula.  Often you can just choose the cell reference that the pivot table has and you are done…  But sometimes the pivot table may have multiple columns, that vary, and you wish to do some sort of lookup on the pivot table…  Traditionally, to do this, it would require creating a GETPIVOTDATA formula.  These have always been a bit complex, and now with PowerPivot being built-in to Excel 2016, they are even MORE complex.  This is why some people choose to use cube functions instead, which can now be done on PowerPivot tables….  But I have tested both of these, and they are very slow when you have a lot of rows.   So, I have gone back to an old school method of using the MATCH function (with absolute references) to determine which column a particular heading is in, and then use the INDEX function to select the value from that heading.  This old school trick turns out to be a LOT more efficient than using GETPIVOTDATA, and thus I thought I should share this with any other Excel nerds who read my blog.