The Financial Benefits of Your Major#
Welcome to Data 88E: Economics Models! This class will explore the intersection of Data Science and Economics. Specifically, we will utilize methods and techniques in data science to examine both basic and upper-division Economics concepts. Throughout the course, we will consider a variety of economic problems both on the macro and micro level.
In the first demo of the course, we hope to give you a sense of the types problems you can expect to explore this semester by considering a problem that may be of personal relevance to you: the post-graduate incomes of different majors at Cal.
We will be using various visualization techniques to analyze the median incomes of different majors at UC Berkeley, in addition to the median incomes of those same majors at other colleges. If you forgot, the median income is the “middle” value: if you sorted all the individual incomes of a major in ascending order, the median would be the value that’s exactly in the middle. The median is also called the 50th percentile – at the median, exactly 50% of the individuals have an income lower than the median.
Do not be concerned if you don’t understand the code below: this entire exercise is purely a demo to motivate many profound concepts in Economics. If you’re interested, you may choose to come back to this demo at the end of the course and consider all the different techniques utilized in it - it’d be a great way of reflecting upon how much you’ve learnt!
Before we can use data science to tackle any issue, we must–well–obtain data (kind of mind-boggling, I know). Recall that we want to examine the median incomes of different majors at UC Berkeley as well as the median incomes of those same majors at other colleges. The term ‘other colleges’ is a fairly general one, and in this case we shall consider the average median incomes of those majors at alll other colleges in the United States.
In order to obtain a dataset, you can either collect it yourself (via surveys, questionnaires, etc.) or you can use datasets that others have gathered for you. In this demo, we are combining 3 different datasets:
The median income for each major at Cal was obtained from Cal’s 2019 First Destination survey.
The median income for each major overall was obtained from surveys conducted by the American Community Survey (ACS) from 2010 to 2012, a very popular data source for Economics Research! In the survey, ACS essentially calls college graduates and asked them their income as well as what they majored in at college. (As a side note, FiveThirtyEight later published this article using the results of the survey.) In this project, we will be using a modified version of the ACS survey - we will only be looking at the respondents who are 28 or younger. Can you think of why we would do this?
The longitudinal data on long-run outcomes of UC Berkeley alumni was obtained from the University of California webpage. We will use this dataset later for a slightly different analysis.
Take a moment to consider the ways in which the 3 different datasets were created. Is it fair to draw direct comparisons between the datasets? What would be some potential issues and how could the differences in our datasets affect our analysis?
Mean vs Median#
Before proceeding further, it is important to consider why we are choosing to look at the median, and not the average, income. In order to answer this question, let us think about what the distribution of incomes for a population would look like. Most likely, you would see a high amount of incomes around or slightly below the mean, with a few massive outlier incomes above the mean. For example, consider a theatre major who becomes a star on Broadway - while they’d be doing absolutely fantastic in their career, they are not representative of the average theatre graduate from Berkeley and would likely pull the average income way up. For this reason, using the median is more robust: it gives us a better idea of what the typical graduate for any major can generally expect to earn.
Now we’ll load in all the data.
Take a look at the tables for each dataset.
P25th referes to the 25th percentile of incomes (the income level at which exactly 25% of incomes are lower) and
P75th refers to the 75th percentile of incomes (the income level at which exactly 75% of incomes are lower).
You may not know what all the different columns in the tables mean. That’s okay!
As data scientists, we often encounter a lot of irrelevant data that we will discard later.
# Load in table of all majors' median incomes at Cal cal_income = Table.read_table("cal_income.csv") cal_income.show(10)
|Major||Cal Median||Cal P25th||Cal P75th|
... (39 rows omitted)
# Load in table of all other universities' average major median incomes other_income = Table.read_table("recent-grads.csv") other_income.show(10)
|2||2416||MINING AND MINERAL ENGINEERING||756||679||77||Engineering||0.101852||7||640||556||170||388||85||0.117241||75000||55000||90000||350||257||50|
|4||2417||NAVAL ARCHITECTURE AND MARINE ENGINEERING||1258||1123||135||Engineering||0.107313||16||758||1069||150||692||40||0.0501253||70000||43000||80000||529||102||0|
|8||5001||ASTRONOMY AND ASTROPHYSICS||1792||832||960||Physical Sciences||0.535714||10||1526||1085||553||827||33||0.0211674||62000||31500||109000||972||500||220|
... (163 rows omitted)
To make direct comparisons across majors, we combined all the tables above into a single one for us to use below.
majors = Table.read_table("cal_vs_all.csv") majors.show(10)
|Index||Major||Major Category||Median Income Difference||Cal P25th||Cal Median||Cal P75th||Overall P25th||Overall Median||Overall P75th|
|1||American Studies||Humanities & Liberal Arts||15000||41600||55000||60000||30000||40000||42000|
|2||Anthropology||Humanities & Liberal Arts||13600||36500||41600||51000||20000||28000||38000|
|3||Applied Mathematics||Computers & Mathematics||35004||65000||80004||108000||34000||45000||63000|
|8||Chemical Biology||Biology & Life Science||12520||44000||49920||68000||29000||37400||50000|
... (39 rows omitted)
Our combined table above dropped the columns in above tables that we didn’t need to conduct our exploration.
It has a column
Median Income Difference: this column is the Berkeley median income minus the overall median income for each major.
It gives us a sense of the value of Cal over the average university: the difference is the additional income we recieve from obtaining a Cal degree.
Before moving forward, take a second to consider how well the above tables would match with each other.
For example, Electrical Engineering and Computer Science (EECS) is a popular major at Berkeley. However, the
majors dataset didn’t have a direct equivalent for it.
majors dataset had Electrical Engineering, Electrical Engineering Technologies and Computer Engineering as separate majors.
Since in theory students in EECS focus more on computer engineering, we chose to use the computer engineering data for drawing comparions in our final, combined table.
However, there’s room for ambiguity here and that is another potential flaw in our exploration!
The below graph displays all the median salaries for all the majors in our dataset side by side. Feel free to look at the values for a few seconds - do you find anything interesting?