How to become a Kaggle Competitions Grandmaster
writtencast 001 - Shujun He
In this inaugural interview of the writtencast, I am joined by Shujun He. Shujun is a P.hD. student at Texas A&M University and a Kaggle Competitions Grandmaster. To become a Kaggle Competitions Grandmaster you need 5 gold medals and at least one solo medal. In this conversation, we explore his journey to becoming a Kaggle Competitions Grandmaster and what he has learned from that experience.
Let's get to it.
How did you get started on Kaggle?
I played Starcraft 2 competitively during college and became very interested in deep learning after Deepmind’s AlphaStar. Later, I came across Kaggle on Google while searching for things related to deep learning. My first competition was “Predicting Molecular Properties” because it was also related to my Ph.D. studies at the time. Everything seemed very complicated and difficult at that time, as even just making a submission file in the correct format took a long time for me. I placed 1294th out of 2739 competitors, so nothing to brag about, but I learned a lot in that competition. Importantly, the top solutions in that competition were very impressive and gave me some useful ideas and inspiration for future competitions.
What does one need to do to become a Kaggle Competitions Grandmaster?
First of all, consistent hard work is a must. Next, one needs to be good at coding and reading and understanding other people’s code. Being able to quickly absorb information from the internet, Kaggle forum, and Kaggle notebooks is also key to performing well. Further, fundamental understanding of all the different algorithms, machine learning or not, is crucial. Good understanding of the data one is trying to model is also important because it will help one select the optimal methods/algorithms to use. Mental fortitude is important because one needs to be able to make good decisions under pressure and also because a solo gold is required for Grandmaster – competing solo can be very stressful. Last but not least, a bit of luck, as consistently placing in the top 1% of 5 different competitions is difficult even for the best of the best.
mlnuggets Newsletter
Join the newsletter to receive the next interview in your inbox.
What role does working in teams play in becoming a Kaggle Competitions Grandmaster?
Working in teams is important because even simply ensembling solutions from different teammates can usually lead to better results. Working with other top competitors is always a pleasure and makes things much easier. However, for beginners, it might be difficult to find good teammates, so it may be better to compete solo and establish oneself first, which is kind of what I did. My first gold medal was a solo gold, so with that, I made a name for myself, which also attracted other competitors to want to team up with me.
Which is the most interesting Kaggle Competition that you have worked on?
My first competition “Predicting Molecular Properties” because everything was relatively unknown to me at that time and that made it interesting and also a great learning experience.
Which is the most challenging Kaggle Competition that you have worked on, and what did you learn from it?
My most challenging competition is probably “Sartorius - Cell Instance Segmentation”, an object detection competition because I had to work with a different framework than PyTorch (detectron 2),. To make matters worse, I got sick during the last days of the competition. I’m relatively unfamiliar with object detection and started the competition quite late (with about one month left), so I learned that if I want to do something new, I should allocate enough time to learn the domain and one month is probably not enough for that.
Which skills can one expect to acquire by participating in Kaggle competitions?
There are many skills one can learn by participating in Kaggle competitions. Firstly, it’s doing validation properly and tracking the performance of one’s models. Also, since most competitions nowadays are code competitions where one cannot see the test dataset, participating in competitions also prepares one to be able to understand how to handle unseen data. Next, teaming up with other competitors is an integral part of Kaggle, so one can also expect to learn how to work in teams. There are many more skills one can get from Kaggling but what I have mentioned are the first that come to mind.
Kaggle launches a competition that you are interested in. Walk us through the process you take from the start to the end of the competition. Do you have a cheat sheet that you use?
Every competition is different so there isn’t really a cheat sheet and you just need to adapt to the situation. I would usually start by looking at the data and understanding what kind of data it is (image/text/scientific data, etc.) Then I will look at what I need to predict and try to just think about the appropriate method to use. The important thing is having a blueprint of what I want to do in my head (e.g what kind of model I want to build, what external data I want to use). Once I have the blueprint, I just start coding.
You were the 1st in the Google Brain - Ventilator Pressure Prediction Kaggle competition. How did you manage this without a medical background?
Typically I would say domain knowledge is not really required, but sometimes if you have some special insight from domain knowledge, it could be quite helpful. Therefore, I didn’t need a medical background as long as I understood the problem. I managed to win because I had very strong teammates, and we worked very well together.
Tell us about the type of hardware you currently use in Kaggle Competitions.
Currently, I have a workstation with 2x3090, although sometimes I also rent or find other resources. For instance, I rented a 8X3090 instance on vast.ai recently for the Google AI4Code competition.
Can one participate in Kaggle Competitions if they don’t have massive GPU hardware? If so, how?
Yes, they can. People should look for competitions in more niche areas other than NLP/CV; unfortunately, NLP/CV requires more and more computing resources. For instance, to get my solo gold medal in OpenVaccine, which was about prediction of mRNA degradation, I used my old gaming PC with 2x2080, and I believe you could have gotten away with just using Colab and Kaggle resources in that competition. There are always competitions just like that and you just have to pick your battles.
Why did you decide to enroll for a Ph.D.?
I did research during my undergraduate studies and thought I was good at it, so it seemed natural to do a Ph.D. afterward.
How are you using deep learning in your Ph.D. studies?
I have used deep learning to study mRNA degradation and DNA informatics. Currently, my Ph.D. research focuses on stabilisation of mRNA vaccines with deep reinforcement learning.
If you were starting in data science and machine learning today. What would your learning process look? Which skills would you start with, and where would you find the resources?
This is a tough one. Data science and machine learning are very multi-faceted, and there are many things you need to learn in order to be successful. The most important skill is definitely coding. You need to be able to materialize your idea by coding it, so I would start by learning to code. Math is also important, particularly linear algebra, so I would learn math and make sure I understand linear algebra. As for where to find resources, I think it suffices to just look up whatever you need on Google and try to find whatever information you need.
You moved from China to the US to study. How was that process for you?
It was a pretty smooth transition. People in the US are generally easy to get along with, and I was able to adapt to college in the US very quickly.
You have written some papers. How important is writing in building a career in data science and machine learning?
Writing is important if you want to express your ideas and present your work. I would say it is definitely something you want to work on if you’re not good at it.
In your opinion, which are the most underrated skills in data science and machine learning?
The most underrated skills are the fundamentals (i.e. math and coding).
What are your plans after you have completed your Ph.D.?
I would like to continue to do research in some capacity.
Where can people find you online?
My Kaggle profile. You can also find me on Linkedin if you want to get in touch with me.
🧡 Enjoy this newsletter?
Forward to a friend and let them know where they can subscribe (hint: it's here).
Anything else? Hit reply to send us feedback or say hello.
Join the conversation: Got more questions or comments? Join the conversation on the comments section.