Merging two data sets.
Today I spent some more time working on the data science boot camp project. I read on the Kaggle page the hash
column did give identifiers for the survey respondents. So I spent time working on merging the data between the two files.
I wanted to start a new Jupyter notebook for the data cleaning, but I didn’t want to rerun the code to translate the columns of the data files from Spanish to English again. Turns out it wasn’t hard to figure out how to create new data frames with the translated columns and save them as .csv
files. I also looked for a way to read both data sets at once but the only code I could find was to simultaneously read and merge multiple data files, so right now my code has me reading the files one at a time.
I wanted to merge only the data for the respondents who did both surveys, using their ids given in the hash
column. The first thing I did was write a loop that would list all of the hash
entries that appeared in both data files, then give the number of entries. When I called the len
function I put the name of the list in quotes and didn’t realize that would give me the length of the string that was the name of the list I made in the loop. After I fixed the problem I saw there were 7090 respondents of both surveys. The original source of the data on Zenodo said there were 5618 respondents who took both surveys and met “inclusion criteria”, so I guess there might be some more data cleaning to do.
It took a lot of Googling around to find the command that would merge the data files using the common hash
ids. I knew I had seen it done before in one of the Erdös Institute videos, but I couldn’t remember which video I’d seen it in. Eventually I found the merge
command from Google searches but I didn’t fill in the arguments correctly. There is one argument called how
and I set it to outer
. The new data frame I created from merging the files contained all the data from both data files, about 160,000 rows of data. It took me a minute, but eventually I found that I needed to put inner
in that argument, and it gave me what I wanted.
The next thing I want to do is rename the first 42 columns of the first data set. Right now these columns are named as numbers and there is a table on the Kaggle site that has the description of each column. So right now I am working on scraping that table so I can set up a loop that will rename those columns for me. I think after that there might be a little more cleaning I want to do, for example, when I merged the two files I got a couple of extra columns with the indices of the rows from each data file. Then some of the columns from the Spanish data set don’t have very descriptive names, even with the translation, and there is no table on Kaggle with those column descriptions. I am thinking about deleting those columns. Right now my merged data frame has 197 columns so I need to find some way to make the data more wieldly.