How to win Kaggle Competitions
Table of Contents
writtencast 003 — Tomohiro Takesako a.k.a Tom — Kaggle Competitions Grandmaster
Tomohiro Takesako is one of 263 Kaggle Competition Grandmasters. He has participated in over 40 competitions on Kaggle. In this interview, he tells us how he got started and the process he used when competing on Kaggle. He also reveals what you need to do to earn the coveted title of a Kaggle Competitions Grandmaster.
How did you get started on Kaggle?
I started with the “Predict Future Sales” competition which is the final project for the Coursera course “How to win a data science competition”. I started to learn ML with Andrew Ng’s great Coursera ML course and then looked for more advanced lectures. That project was really tough for me but I learned a lot. After that project, I joined “Santander Value Prediction Challenge” which was not the standard competition in the end, but was memorable for me as it was my first competition with ranking points and tiers.
Join the newsletter to receive the technical deep dives in your inbox.
What does one need to do to become a Kaggle Competitions Grandmaster?
Continue to join competitions as much as possible. You need 5 gold medals at least :). I usually join competitions that look interesting from some points of view (is the task challenging, is the field new for me, or is it a meaningful task?) or which is probably good to keep the motivation.
Try not to give up easily. Sometimes a bit more experiments give a good insight.
What role does working in teams play in becoming a Kaggle Competitions Grandmaster?
We can learn new things (techniques, how to work on a competition, etc) from teammates. These are often helpful for getting more gold medals. Also, working with teammates might give us the motivation to run through a competition. If you have strong teammates, you’re lucky since you will have more chances to be in the gold zone :).
I like to play solo, too. This is because I can make all of the decisions during the competition when I am solo, which is a good learning process different from team play.
Which is the most interesting Kaggle Competition that you have worked on?
“Generative Dog Images” :). This was the first GAN competition on Kaggle, where we needed to train a model from scratch within a Kaggle notebook (9 hour restriction)! Getting a good output from GAN was really interesting. I even continued to generate fake dogs after the competition.
Which is the most challenging Kaggle Competition that you have worked on and what did you learn from it?
“Human Protein Atlas — Single Cell Classification”. This was a competition with a weak label setting. We were given labels per image but we needed to predict labels of instances inside images. So we didn’t have any direct labels for the target instances. I learned that creative data augmentation was really important, which I missed.
I also want to select “ VSB Fault Detection” which was really tough because the data is noisy and the CV-LB correlation was not good (for most teams, I think). Our team shook down around 500 places in private LB from 14th place in public LB. It was my first ever big shakedown. I trusted my CV but failed. I learned that even if CV-LB looks correlated, there is still a chance to miss something important.
Which skills can one expect to acquire by participating in Kaggle Competitions?
I think you can acquire lots of skills: coding skills, the skill to do experiments more efficiently and rapidly, the skill to use state-of-the-art ML DL libraries, the skill to research related papers, and so on. But you will need to spend tons of time to acquire them.
Kaggle launches a competition that you are interested in participating in. Walk us through the process you take from the start to the end of the competition. Do you have a cheat sheet that you use?
- I will read the top page of the competition (overview, data, discussion, rules), then click the join button.
- Create a new notebook and look at the data. Then do some EDA.
- Download the dataset and create a baseline model. Then submit it to check the CV-LB correlation.
- Try to get a decent CV-LB with a small model. If I’m satisfied with it, move to a larger one.
- Apply standard methods including ensemble to check what is the most important factor in the competition.
- After trying all of the ideas, read the discussions and check the top solutions in past competitions. Try what I find interesting and check the performance.
- If I’m stuck, do some error analysis to find a good insight.
- In the last stage of the competition, use more models and bigger ensembles if I need.
Join the newsletter to receive the technical deep dives in your inbox.
Tell us about the type of hardware you currently use in Kaggle Competitions.
- CPU : Intel Core i7–8700K
- GPU : NVIDIA Titan RTX
- SSD : 2TB
- Memory : 64GB
Can one participate in Kaggle Competitions if they don’t have massive GPU hardware? If so, how?
It depends on the competition you join. But good hardware is helpful for you to focus on the competition’s task itself.
Why did you decide to enroll for a Ph.D.?
I wanted to do research on physics, so it was natural for me to go to the Ph.D. course.
If you were starting in data science and machine learning today. What would your learning process look like? Which skills would you start with and where would you find the resources?
I would start with some online courses. Andrew Ng’s ML lectures on Coursera would be my starting point. I would also read textbooks about ML. As for DL, I would start with the official PyTorch tutorial and Kaggle courses.
You have written some papers. How important is writing in building a career in data science and machine learning?
I think it is not directly helpful for my data science career because my papers are mainly about dark matter models in particle physics :). But that experience is helpful for me to search papers and understand the context of fields.
In your opinion, which are the most underrated skills in data science and machine learning?
Not sure, but in Kaggle, I think creating a solid baseline rapidly is one of the important skills which is not rated in the end (if you don’t make it public).
People have said that Kaggle competitions don’t mirror real-world problems partly because sometimes the data is already cleaned, and the fact that is a leaderboard to test your solutions on. What is your response to this claim?
I don’t care about that since Kaggle focuses on creating the best performant models and it is meaningful by itself. You can’t win only by tuning model parameters. Also, sometimes we have almost real-world data in Kaggle. Anyway, I recommend such people run through at least 10 competitions. It’s fun :).
How do you apply machine learning in claim automation and fraud qualification at Spout.ai?
We’re using OCR and NLP techniques. So DL models are the central part of them. This is challenging.
Apart from mainstream machine learning packages such as TensorFlow and PyTorch. Which other tools do you use in your work that are little known?
I think I use mainstream packages only.
Where can people find you online?
- My Kaggle account.
- My LinkedIn account.
Whenever you're ready, there is 2 way I can help you:
If you're looking for a way to build a career while writing about data science and machine learning, I'd recommend starting with an affordable ebook:
→ Writing for Data Scientists: The exact path I followed to get technical work that pays between $250-$500 from machine learning companies such as Comet, Neptune, cnvrg, Paperspace, Layer, Neural Magic, Determined, Activeloop, and many more. Get your copy.
→ Data Science and Machine Learning Ebook: I offer numerous free and paid data science and machine learning ebooks to help you in your data science career. Check them out.
Join the newsletter to receive the latest updates in your inbox.