Semi-supervised graph labelling reveals increasing partisanship in the United States Congress

Max Glonek, University of Adelaide
10 June, 2019

Prerequisites:
* R/RStudio
* Matlab
* R packages:
   - tidyverse
   - stringr
   - igraph
   - Rvoteview (see here for installation instructions: https://github.com/voteview/Rvoteview)
   
Packages in R can be installed using the R command: install.packages("package_name")

Instructions:
1. Create a parent directory.
2. Within this parent directory, create the following subdirectories:
   - "comp data"
   - "final data"
   - "out data"
   - "raw data"
   - "source data"
3. Place the following files in the parent directory:
   - 001_get_data.R
   - 003_get_leaders.R
   - 004_clean_data.R
   - 005_build_network.R
   - 007_evaluate results.R
   - comp_rule_5a.R
   - comp_rule_5b.m
   - comp_rule_5c.R
   - control.csv
   - markov_chain_006.m
   - roles.csv
   - rule_1_2_clean.R
3. In the designated area at the top of each R-Script file (*.R), you will need to specify your parent directory as the working directory. For example, on a Windows system, if your parent directory is "C:\Users\Max\GLaSS", paste the R command: setwd("C:/Users/Max/GLaSS")
4. First, run "001_get_data.R". This file downloads member information and party information for all 42 Houses and all 42 Senates considered in this project. Files are saved in the "raw data" subdirectory.
5. Member vote and roll call vote information must be downloaded from the Voteview website (https://voteview.com/data). To download member vote information, use the following settings:
   - Data Type: "Members' Votes"
   - Chamber: "House Only" (for House data) or "Senate Only" (for Senate data); House and Senate data must be downloaded separately
   - Congress: Individually from "74th (1935-1937)" to "115th (2017-2019)", inclusive
   - File Format: "CSV (Recommended)"
To download roll call vote information, use the following settings:
   - Data Type: "Congressional Votes"
   - Chamber: "House Only" (for House data) or "Senate Only" (for Senate data); House and Senate data must be downloaded separately
   - Congress: Individually from "74th (1935-1937)" to "115th (2017-2019)", inclusive
   - File Format: "CSV (Recommended)"
For example, member vote information for the 74th house will be contained in "H074_votes.csv", and roll call vote information for the 85th senate will be contained in "S085_rollcalls.csv".
All downloaded files should be stored in the "raw data" subdirectory.
6. Run "003_get_leaders.R". This file determines, from a masterlist in "roles.csv", the icpsr (member ID) of the Democrat and Republican leader for every vote in every congress. All output is saved in the "source data" subdirectory.
7. Run "004_clean_data.R". This file prepares various datasets for further analysis by applying cleaning rules described in the paper, and removing extraneous variables for each house and senate in the study. All output is saved to the "source data" subdirectory.
8. Run "005_build_network.R". This file builds, from the source material in "source data", a graph of each house and senate in the study. For each graph, the weighted adjacency matrix and a list of node names/IDs is produced and saved in the "out data" subdirectory.
9. Run "markov_chain_006.m". This file exactly calculates absorption probabilities and expected times to absorption for each graph (based on the weighted adjacency matrices in "out data"). All output is saved in the "final data" subdirectory.
10. Run "007_evaluate_results.R". This file analyses the absorption probabilities and estimates labels for all nodes in each graph. Estimated labels are compared to real labels, and F1 score is calculated. Summary details for members are saved in the "final data" subdirectory. Summary details for all Houses are saved in the parent directory as "H_stats.csv", while details for the Senate are saved as "S_stats.csv". "standardised differences" for every house and senate are calculated, and saved in the parent directory as "H_diff.csv" and "S_diff.csv", respectively. This file also fits linear models to explore the impact of party control of the House, Senate, and Presidency on partisanship in the House and Senate.
Optionally:
11. Run "rule_1_2_clean.R". This file measures the impact of successively applying data cleaning rules 1 and 2. Output is saved in the parent directory.
12. Run "comp_rule_5a.R", THEN "comp_rule_5b.m", THEN "comp_rule_5c.R". These files repeat analyses for the 115th House and 90th Senate, but without applying data cleaning rule 5 to the graphs. Output is saved in the "comp data" subdirectory.