Machine learning: Optimizing loss functions II.

February 3, 2025

Loss functions are used to train neural networks, but really, they arise in all machine learning algorithms. The last post talked about the MSE function for linear regression.

Machine learning: Optimizing loss functions I.

January 26, 2025

I recently finished a Kaggle competition where I had to predict housing prices, given 79 features. I tested several models, including linear and polynomial regression, some tree-based algorithms, and most interestingly, a neural network using PyTorch. I learned a lot about neural networks in the process, and I wanted to touch on one aspect in these next two posts, loss functions.

Two-pointer technique II.

January 3, 2025

See the last post for an introduction. Here are two more applications of the two-pointer technique.

Two-pointer technique I.

January 2, 2025

Been awhile! After the election forecast I started a Kaggle competition that I am looking to wrap up pretty soon. But in the meantime I got a job interview, and it’s a coding interview. I scheduled it for the first or second week of February so I would have time to prepare. A friend of mine at Google said he used Leetcode to prepare for his interview so that’s where I started. I went ahead and signed up for Premium because I wanted half off on courses. Then I found an Explore Card track for data structures, that I believe comes with Premium already, and began with that. Turned out to be more challenging than I expected! Took over 6 hours to get through the first card, and with 19 cards, I knew I wouldn’t be able to get through them all in time. By the second card I was completely lost and suddenly my preparation felt unproductive. Then my friend told me to look at the interview materials on Leetcode and I found this course in data structures that claimed to be an efficient way to prepare for a coding interview. It is! And something that came up in the course that had been touched on in the Explore Cards, too, is what’s known as the two-pointer technique.

Machine learning: MASE.

October 28, 2024

I recently completed a forecast project for the 2024 U.S. presidential election that used a double exponential smoothing algorithm. In developing the model, I used a grid search with the metric MASE, which stands for mean absolute scaled error.

Access codes and Python.

October 4, 2024

As promised, a post about the access code I had to enter for this semester’s data science boot camp.

Localization and dependencies.

September 27, 2024

I haven’t finished the SQL course yet, but I’ve learned about everything on the certification exam now except creating tables, which I already kind of know how to do, anyway. Recently got excited about Erdős Institute’s Software Engineering for Data Scientists asynchronous course and I’m sort of participating in this fall’s DS boot camp, too.

The ternary operator.

September 4, 2024

When I first bought my JavaScript book (JavaScript: The Definitive Guide by David Flanagan), I read like the first ten chapters of it at once. At one point, I read about something called the ternary operator, given by ?:. At the time I didn’t understand exactly what its use was, but I’ve since seen it again and decided to look into it.

Learning SQL.

August 27, 2024

I’m doing a SQL course on Udemy! It’s the one that’s ranked really high. At first I thought it was too slow, but I bought Oracle’s SQL book, which the instructor of the course recommended, and it turns out the videos I’ve watched so far cover like 4 chapters of the book already. That, along with the interview prep exercises I found on leetcode, should be plenty to make me feel prepared for the exam. In this post I’d like to talk about some neat stuff I’m learning.

Imputing in pandas with `.groupby`.

August 13, 2024

Finally, another post about pandas (and Tableau)! Last time, as I saw the time before that when I made multiple posts in one day, I had to add timestamps to the YAML front matter to get the posts to display in the correct order. Wasn’t working for my last two posts until I changed the times to earlier times. Turns out the time I write is the local time, so even if I write date: 2024-07-24 18:00:00 -06:00, which should be noon Mountain Time, it doesn’t work if 6p MT hasn’t occurred yet. Confusing, but I guess it works now.

Tableau challenges with pandas.

July 24, 2024

As I mentioned in the last post, right now I am working on a Tableau project with polling data from fivethirtyeight. Overall I am looking to make a time series line chart with the poll results for each candidate over time (I feel like that’s the big story, later I can add filters like how the results change depending on who’s polled, who does the polling, etc.). I’m going by the end date for each poll. Here are some challenges:

Learning about pandas.

July 24, 2024

I’ve taken a break from the Node to SQL project to get a few quicker projects on my resumé. The current one is a data visualization in Tableau. I downloaded some polling data from fivethirtyeight that gives presidential polling results dating from 7 April 2021 to 18 July 2024. I may do a post on Tableau later but for now I’d like to do one on pandas, since it’s been my main method of cleaning the data.

Node to MySQL.

July 11, 2024

I spent all of this week watching videos to learn how to access MySQL via Node. Looked at 3 videos, and this one by Nsquared Coding seemed to be the best. For some reason with the other videos I tried following along and got mostly permissions errors when I tried to connect to the local server using JavaScript commands. The Nsquared Coding one not only had complete instructions for a full stack project, but it was a little different in that it used XAMPP to connect to the MySQL port and make the database.

Creating a server with Node.

July 4, 2024

Here’s the video I’m using to make a basic server.

Updates and Node.

July 3, 2024

OK, I see that if I make more than one blog post in the same day Jekyll posts them in descending alphabetical order by title. So this post was meant to go before Creating a server with Node. Update (11 Jul 2024): Turns out I can just add a timestamp to get the multiple posts on the same day in the right order!

Been a year.

May 17, 2024

Now that I’m wrapping up my contract it’s time to get back to professional development (this job kept me BUSY for the past year!). I decided my first priority would be to find a way to wrap up that 538 project. It’s a shame that I worked so hard on it, and I do have something to show for it, but I haven’t managed to put it on a resumé or anything. Today I had to refamiliarize myself with everything, it seemed.

Opened up Jupyter. Turns out my ProblemStatment file alone is self-contained and impressive enough. It’s the intro to the Erdös Institute Data Science Boot Camp project I tried to do last spring. The code scrapes metadata from the features pages on fivethirtyeight.com The highlight of the code is the function I wrote to scrape the number of comments. I have a folder with output files, the most recent has the metadata from 1026 posts. Nice and neat, I figure there must be a simple way to present this metadata on a webpage.

Git for Windows and command lines.

January 23, 2024

Been awhile! Last semester was way busier than I was anticipating, but I’m trying to make a better time management plan for this spring.

Fixed some issues.

May 6, 2023

The other night I finally figured out what was wrong with the pictures on this blog! They weren’t loading in my posts, and then recently my avatar stopped loading, too. I had been meaning/trying to fix this for a long time. It turned out to be pretty easy to fix.

My first data challenge.

May 4, 2023

As part of a technical interview workshop with Erdös Institute, this morning I did a data challenge to present at the problem session tomorrow. I gave myself 3 hours to do it, and barely got through cleaning the data. Toward the end of the 3 hours, I realized and learned some things I could’ve done to get further along, but I just ran out of time. But it was interesting putting my Python abilities to the test in a timed setting. I don’t think I know Python as well as I thought I did. Or maybe I already had a realistic idea of how proficient I am – my application for the boot camp TA position was near perfect, in my opinion, but it also took 6 hours to complete (and Matt doesn’t know that). Now I want to try more data challenges to see if I can get better.

Personal website.

May 1, 2023

This past weekend I decided to make a personal website. Whenever I have a professional website, if I switch jobs then I lose the site. I wanted to have a permanent place with all of my materials. I used GitHub Pages to host the site.

Subdirectory.

April 27, 2023

I am trying to change the url of my blog but I can’t figure it out. I followed the directions for Solution 2 on this page but it didn’t work. I also edited the _config.yml file by changing the url and baseurl. I also noticed in that file I have mathjax set to true, so I don’t know why I’ve also had problems rendering LaTeX in my blog posts.

Machine learning: $k$-nearest neighbors.

April 25, 2023

I spent the last couple of days working on the applciation for the TA position at this May’s data science bootcamp. Most of it was very basic questions like writing one-line commands and diagnosing and fixing code with errors. One question tested list comprehension (I still need to do a post on that – maybe tomorrow?). Then near the end was a problem where I had to apply the $k$-nearest neighbors ($k$NN) algorithm to some manufactured data! No machine learning background required, according to the application, since there was documentation for the Python module that would be used. It was challenging, like I said I spent 2 days on it, but I did learn something.

Fitbit stats project.

April 18, 2023

It’s done, I submitted the project on Friday!

Cleaning JSON data.

April 7, 2023

On Monday I started a new project for Erdös Institute’s data visualization minicourse. For the project I plan to build a dashboard using my Fitbit data for the past year and d3.js, a JavaScript library. It was daunting at first, every time I see JavaScript I get intimidated, even when I watched Matt’s d3.js videos I had trouble following what was going on. So I thought by using d3.js in my project I could get more comfortable with JavaScript.

Machine learning: PCA II.

April 3, 2023

Check out my previous post on the idea and math behind PCA.

Machine learning: PCA I.

April 1, 2023

PCA stands for Principal Component Analysis. It’s what’s called an unsupervised learning algorithm and it’s a (clever!) way of reducing the number of dimensions in a data set. As a bonus, it reveals correlations between features.

More scraping obstacles.

March 30, 2023

Matt told me I really need to get at least 500 posts’ worth of data to proceed with the data exploration so I’ve been at it since Tuesday. So frustrating, I thought I had my scraping code finalized on Saturday and yet it hasn’t worked well enough to get me the complete data set. Here are some of the obstacles:

Preprocessing!

March 26, 2023

Yesterday I moved to the next phase of my project, data cleaning and preprocessing. I really didn’t have to do too much cleaning! Every time I thought of a way to make the data prettier I just incorporated it into the data gathering code.

New project.

March 24, 2023

This past week has been spent working on the new project. Coming up with a new topic seemed daunting, so I took the rest of last Friday off and started looking on Saturday. But I felt like after getting all that experience scraping data on the last project attempt my options for where to get data were more open. I got this idea that I would analyze the posts of Fox News and MSNBC against the number of comments and write an algorithm to predict which topics would get more traffic. Unfortunately MSNBC doesn’t have a comments section. I read that a lot of news sites are getting rid of comments under the rationale that a toxic comments section will discredit the article. I tried to look at other news sites and it seemed to be true. Ultimately I decided I would just get my data from the 538 politics section. It has comments and the posts already have tags so I don’t need to do any advanced NLP to get the topics.

Conclusions from cleaning.

March 17, 2023

This week I learned a lot when trying to clean my project data. I fixed the loop I made with the user inputs. At first it wouldn’t change the names of the columns, even when I added the .rename command. It turns out I needed to include the attribute inplace=True, which ensures that the data frame gets changed with the command. I was kind of confused about that, because I think the default for that attribute is supposed to be true, but the code wouldn’t work without me including it. I also added a few more print commands to the loop for debugging purposes. I’m pretty proud of the loop, I’ve included it all, with all the comments.

# Loop that will prompt me to delete or rename each column
for i in sorted(list(range(0,len(columns))), 
                reverse=True):
    # Prompt to delete
    print("Column index:",i)
    print("Column name:",columns[i])
    delete_choice = input("Delete column (y or n)?  ")
    # Make sure the input is valid
    while ((delete_choice != "y") and (delete_choice != "n")):
                          delete_choice = input("Enter y or n.  ")
    # Delete if yes
    if (delete_choice == "y"):  
                          del master_data_cleaned[columns[i]]
                          print()  
    # Prompt to rename if no
    if (delete_choice == "n"):
                          rename_choice = input("Rename the column (y or n)?  ")
                          # Make sure the input is valid
                          while ((rename_choice != "y") and (rename_choice != "n")):
                              rename_choice = input("Enter y or n.  ")
                          # Rename if yes
                          if (rename_choice == "y"):
                                new_name = input("Enter the new name: ")
                                master_data_cleaned.rename(columns = {columns[i]:new_name}, inplace=True)
                                # Verify the name was changed
                                new_columns = master_data_cleaned.columns.values.tolist()
                                master_index = new_columns.index(new_name)
                                print("Now new_columns["+str(master_index)+"] = "+new_columns[master_index]+".")
                                print()
                          # Pass if no
                          if (rename_choice == "n"):
                                print()
                                pass 

Once I got the loop completely working, I started to go through it and edit each of the 198 columns, but once I did I started to think this really isn’t the most efficient way to clean this data, maybe there’s still a better way. I also noticed that even with the English translations, many of the column titles were still not descriptive enough for me to use in data analysis, so it felt like I was throwing out a lot of data.

Data cleaning.

March 11, 2023

Today and last night I worked more on the bootcamp project. My first goal was to scrape the table with the column names for the first file from Kaggle. It turned out to be a little more involved, because the table was not made using html, it was made using Markdown in a script element. It took awhile, but I was able to pull the text from within that script element, then use the split function to create strings separated by the newline character, to get each row of the table. Then I used split again to create two lists, one for each column of the table. Then it didn’t take long to rename the columns in my master data file according to the values in the table.

Merging two data sets.

March 8, 2023

Today I spent some more time working on the data science boot camp project. I read on the Kaggle page the hash column did give identifiers for the survey respondents. So I spent time working on merging the data between the two files.

Translating using python.

March 1, 2023

Today I started working on a project for the Erdös Institute’s data science bootcamp that I attended last fall. The first task is to find a data set, state a question that can be answered using the data, then identify stakeholders and key performance indicators (KPIs). I decided to use a trending data set from Kaggle on the correlation between drug use and mental health during the COVID-19 pandemic. The data is from two surveys conducted during spring and summer of 2020.

Back to the blog.

February 27, 2023

Today I saw a repository in my github that I didn’t recognize and found out it’s a blog I forgot I had created in summer of 2017. How did I forget about that? I read all the posts and realized there is a lot about programming that I knew back then and can’t recall now. The blog lasted only about a month, so I guess I must’ve learned all that stuff that summer. That gives me hope that I can get more proficient with my coding abilities in the next month or so.

Test drive!

July 1, 2017

As an exercise, go back to the 10 June 2017 post and zoom in until the screen capture of my C snippet illustrating an array becomes readable. Assuming the all-caps comments are filled in with correct code, or filling them in yourself, what do you expect will be the result of running this little program? I could’ve sworn I’d already tested this using an online compiler but yesterday I found out I was wrong.

My online presence.

June 25, 2017

I was thinking about a blog post re: more general stuff. The privacy risks of using a browser extension like Honey, and then inviting my Facebook friends to download it, too.

Testing sorting algorithms.

June 18, 2017

Gilbert & Forouzan recommends testing sorting algorithms with four types of arrays:

Structures in C.

June 10, 2017

I am working on a page that lists each of the following data structures

array list/vector
linked list
queue
stack
hashtable
binary search tree
priority queue/heap

Comments about comments.

June 1, 2017

At some point soon I would like to figure out how to enable comments. It’s something to do with registering with Disqus. Meanwhile, I’m aware some computer files should be stored locally, rather than somewhere like Dropbox or in a Public or All Users folder. I might crowdsource this on Facebook – the question is, which files are appropriate for storing on the cloud, and which files are better stored on an internal hard drive? I learned today, while spending three hours trying to compile the McDowell Latex resumé template, that it makes a very big difference whether or not you run Miktex as an administrator when updating or installing packages. This issue is that depending on who you update as, Administrator vs. User, the files that get downloaded go to wildly different places on the hard drive. In fact, from what I can tell 90% of compile errors result from the compiler not being able to find the files it needs, because of this issue.

Studying notes.

May 31, 2017

Factorial powers of ten

I'm up and running!

May 30, 2017

It’s a first blog post! I got this blogging page going via an article I found just from browsing around the Github support pages, written by Barry Clark.