If you are a programmer, and haven’t heard of the Q&A site, StackOverflow, remedy that immediately.
Inspired by a post there, I wrote a simple web-crawler that looked at the age, membership length and reputation of the users of StackOverflow.
I describe the method below, for people who like hearing about software. I graph the results below for people who like looking at colourful graphs. At the bottom, I have a quick conclusion for people who just want the summary.
I wrote a simple Python script that fetched each user’s profile page, and snipped out the user’s age (if present), their reputation and the numbers of days they had been a member of StackOverflow. (Code and data available upon request.)
The script ran for around 18 hours around October 23-24, 2008.
Deleted users (having missing user pages) were ignored leaving 57,144 users.
Using feedback from the original post, I treated all ages 8 and below, and also the maximum permissible age of 88 as missing. This removed several outliers. People’s ages are self-reported, and may be wrong or biased. Just over 30,000 users had valid ages.
I grouped reputations into several buckets. The range of the buckets were sized on a logarithmic scale. A reputation score of 1 was put in its own bucket, as it represents the default reputation. Probably, the majority of these users are not active.
I then hastily constructed some Excel charts. I am using the default appearance for Excel charts; sorry for any perceived unprofessionalism, but this was a hasty, and unprofessional, excercise.
Here are the results, presented as charts.
Number of Users By Age
The first graph reproduces the original post. It shows the age of the users.
Number of Users By Reputation
This graph shows over half of the users are sitting on a reputation score of 1.
This graph shows how StackOverflow has grown in terms of members. The sharp spike almost certainly represents the growth after the site moved from Private Beta to Public Beta. You may have heard Jeff Atwood and Joel Spolsky discussing this curve on their StackOverflow podcast.
Reputation by Membership Length
This chart is a little more complex. It shows how people’s reputations are associated with the time they have been on the site.
The left part of the graph represents recent joiners. The right part of the graph represents the old-timers.
The gentle slope down to the right represents the obvious fact that, the longer you have been a member, the more likely you are to have a higher reputation.
The sudden drop downwards in the middle represents the date of the public beta starting. Data to the right of the spike represent beta testers.
One way to use it is to find out how long you have been a member along the bottom, and rule a line straight up. The different coloured sections show what percentage of people have what reputation, so you can see your very rough percentile position compared to others.
Reputation by Age
This is the result which originally inspired this project. It shows the relationship (or the lack of it) between age and reputation.
For this chart, I removed all ages which had less than 10 data points to ensure the data is meaningful. (Why 10? I used the sophisticated statistically modelling technique called “Feels about right”.)
The result seems to me to be pretty flat, within the margins of error. There doesn’t seem to be a strong correlation. (No rigourous statistical measures were harmed in the making of this assessment.)
Don’t be fooled by the dips around 16 and 54. That is just Excel’s graphing tool treating missing values as 0.
Users of StackOverflow who aren’t reticent to share their age are mainly in their late-twenties to mid-thirties, with a moderately long tail towards retirement age.
About half have a reputation of 1. Another quarter have a moderate reputation between 20-199. The top 1% have reputations above 2,000.
The longer you are on StackOverflow, the more likely you are to have a higher reputation. Duh! Beta testers were more likely to be active users (reputation-wise) than the general public. Double duh!
Age (within the typical ranges) doesn’t have a big influence on reputation. Outside the typical ranges, there was insufficient data to be sure.