PhDs Awarded in the US in 2017
The National Center for Science and Engineering Statistics (NCSES) of the National Science Foundation (NSF) is a statistical agency that specializes in the collection and analysis of data related to science and engineering. NCSES designs and administrates national surveys to support researches that uses its data. One of these datasets is the Science & Engineering Doctorates, which is a collection of tables that entails the demographic characteristics, educational history, sources of financial support, and post graduation plans of doctorate recipients. In this blog, I go though the creation of a dashboard that showcases PhDs awarded in 2017 by state and fields of study (Table 7).
The original data contains the total numbers doctorates granted broken down by institutions and fields of study. However, institution-level data are too granular and harder to visualize in a choropleth map. Therefore, I decide to use the provided state-level sum as the unit of analysis. For the field of study, the data includes 3 categories, which are ‘All fields’, ‘Science’ and ‘Engineering’, where the latter 2 are further broken down into subfields. For example, subfields such as ‘Chemistry’ and ‘Mathematics and statistics’ are part of ‘Science’, where as ‘Bioengineering and biomedical’ and ‘Mechanical’ are part of ‘Engineering’.
Every visualization project should always start by defining the objective, or what information is the visualization trying to convey. For this project, I’m interested in learning which state in the U.S has the highest number of PhDs granted in 2017. It would also be learn what are the ‘hottest’ fields, which could help younger generations better navigate themselves in higher education.
As previously mentioned, the unit of analysis is State, which means I want to remove all institution-level data (rows). A brief inspection shows that institution-level data have characters such as ‘U.’ and ‘C.’ in their indices, which represent university and college respectively. This allows me to drop data using regex patterns.
To answer the question on which fields are the hottest, I want to look at distribution of each field for each state, which should tell me if more people prefer studying Science or Engineering. The ‘All fields’ columns actually contains fields beside science and engineering. Therefore, I create a column that shows the count for non-science or engineering fields.
Finally, instead of the full name of each state used in the original dataset, Plotly’s needs the 2-letter abbreviation of each state to map the values on the choropleth graph. Therefore, I found another dataset that contains both the full name and 2 letter code of each US state, which I will join with the original table from NCSES.
The following snippet is a function that contains all data cleaning I did, including renaming some columns. This function returns 3 data frame, which represent all fields, science and engineering.
In order to learn how the popularity of each field (and their subfields) vary across different states, a choropleth graph would be the ideal way to convey this message. More concretely, the choropleth gradient can be used to color-code each state by the total count of doctorates-granted. In addition, if one likes to learn the detailed distribution of all subfields within a field for a selected state, a bar chart could then be used where the bar height encode the count for a subfield. These 2 graphs together would allow users to explore field popularity for each state at different level of granularities.
Finally, I also want to user the option to focus on a single field. This makes sense as science and engineering are two separate entities. It also makes the bar graph more interpretable as the bars would be less crowded. To achieve this, I will use radio items which lets uses choose between science only, engineering only or total fields (including non-science and engineering fields). The choropleth and bar graphs will then work interactively with the radio items. More specifically, the graphs will change and present different information based on user selection, as shown in the following gif:
There are 2 two essential components in building a Dash dashboard, which are the layout and callbacks. The layout specifies how different figure components are put together in a HTML style script, whereas the callbacks specify how components interact with each other.
To get started with layout, I first instantiate my choropleth and bar plot objects with Plotly Express. This is shown in the following snippet:
Note that for choropleth, the scope is set to ‘use’ to shown the United States only. In order to map the states with their corresponding counts, I set locationmode to ‘USA-states’ and locations to ‘Code’, which is the column that stores the 2-letter abbreviation of each state in my dataset. Finally, I set color to ‘Total’, which is the column that stores count of doctorates granted.
For the barplot, I set x to ‘Field’, which is the column that contains all subfield names. And y to ‘Count’, which is the doctorate count for each subfield. I have also used the update_layout method, which contains a series of key value pairs to remove the background color and the legend associated with the barplot.
Next, these figures are called within the appropriate dcc (dash core component) in the desired sequence in app.layout. In addition, I have to assign a unique ‘id’ to each component so they could be referred to in the callback stage.
For the radioitem, since it’s not a graph, it could be constructed simply using the dcc.RadioItems, where a pair of ‘label’ and ‘value’ needs to be supplied to the option parameter for each RadioItem.
The code for layout is shown in the following snippet:
A call back starts with a decorator @app.callback(), where we define the Output and Input in the parenthesis. Both output and input accepts a component_id, which is the unique ID we previously assigned to each component, and a component_property, which is a dynamic property of the component. For example, for radio items, the component_property I want to use would be ‘value’, which yields the item that user selects. If there are more than one Inputs or Outputs, they should be kept inside a pair of square bracket as a list.
The decorator us always followed by a function, whose name is arbitrary. The parameterto the function is what goes into the component property of the Input. Therefore, the number of parameters should be consistent with the number of inputs in the decorator. The function should perform some operations, such as updating the choropleth figure based on the input, and returns one or multiple things that correspond to the Output in the decorator.
In my case, two callbacks will be used. The first call back updates the choropleth based on what is selected in the radioitems. This effectively present different choropleth for ‘Total’, ‘Science Only’ and ‘Engineering Only’. The second call back would update the barplot based on the radio items, as well as the State that the cursor hovers on in the choropleth. The combination of these two inputs would produce a barplot that displays the distribution of subfields for a particular State. These two callbacks are shown in the following snippet.
Based on the constructed dashboard, we can see that California gave out most doctorates degrees in 2017, followed by Texas and New York. And among all doctorates, science is definitely the most preferable field. Within the Science field, Biological and biomedical sciences is the most common subfield for most of the States. When it comes to engineering, Electrical, electronics, and communications is the dominant engineering subfield in most states.
The complete implementation can be found in the following terminal, the dashboard can be recreated by running python application.py in terminal.