“Intro to machine learning” Kaggle Learn course: from zero to beginner
How the practical skills taught by the introductory course from Kaggle can bring new data scientists to the Kaggle community and market.
In 2012, Harvard Business Review dubbed Data Scientist the sexiest job of the 21st Century, as it would be known for a long time. Although, seven years later, much of its sensuality would be lost by the maturity of the function and by its consolidation in the market, the data scientist, as seen at LinkedIn’s most promising jobs, continues to lead the jobs opening rank and the Career Advancement Score — based on LinkedIn’s data.
There are a huge amount of places to acquire this still growing job skills, by eg. Coursera, Udacity, Datacamp. However, I will focus on one of the Kaggle Learn’s introductory courses: Intro to machine learning by Dan Becker.
What is Kaggle
Kaggle born as a place to host machine learning competitions, but it ended up becoming the actual world’s biggest “data science home” and it’s filled with a lot of resources that make it a great place to learn. The site has a supportive community that discusses competitions, answers questions, and gives amazing feedbacks on your solutions. Beyond the thousands of datasets to explore and produce insights — testing your modeling skills, the huge amount of competitions is great to learn new techniques or test your own. And at last but not at least, the Kernels are one of the most important points to turn Kaggle a nice place to learn. Kernels are virtual environments running on Kaggle’s servers that are free to run python/R/Julia scripts or Jupyter Notebooks, and they are great resources because you don’t need to worry about setting up any local environment, turning it easier for beginners learn about data science.
Given the environment mentioned above and with the focus of offering practical courses free of charge, Kaggle Learn was born. The courses are a series of selected Kernels wrote by the own community users and they are a great kick-off to you go from zero to beginner and start to trail your data science Hero’s journey.
Intro to Machine Learning
“Intro to Machine Learning” was the first course I did on Kaggle. As a recent computer science student, I’ve just known python basic syntax and basic programming logic. But, as mentioned by the course, the only prerequisite to complete it is python basic knowledge, which makes it a great gateway to beginners start to enter the data science world and, successively, study new stuff about it.
The course is in the form of Jupyter Notebook and it is a series of Kernels. So, it possibilities some interesting resources:
- Interactive content: the Jupyter Notebook allows the user to writes both markdown and script content. It is useful because it possibles read the theory and do some coding exercises about the content on the same page. It also is easily editable by anyone, feeding the Kaggle Learn idea of practical and “hands-on” courses made by the own community.
- No setup: the Kernel runs on Kaggle’s servers what makes no local environment needed. It’s just a hands-on the wheel for any beginner since you don’t need to search how to set up anything on your pc, it’s start and go!
- Forking: one of the most interesting resources from Kernels is the forking possibility. You can copy a Kernel from anyone and continues their job, but now, by your branch. It’s a great method to test your thoughts or ideas and start to understand how to work collaboratively in data science. Quoting Will Koehrsen: data scientists stand not on the shoulders of giants, but on the backs of thousands of individuals who have made their work public for the benefit of all.
Modules
There are twelve modules in this course. They are centered on quickly getting a basic understanding of some introductory concepts and initiate the student on competitions. They mainly are:
- How models work: On this module are introduced the concept of model and how we use them to predict targets or analyze data. The first model exposed is a simple decision tree. The choose was made by its simplicity, by its easy visualization and because it is a building block to other fancier models.
- Basic Data Exploration: The second and third modules are a pair of theory and exercises about how to explore your dataset with the python library Pandas. On this pair is introduced how to read a CSV file as a Pandas dataframe and how to analyze it with the describe function. As an exercise, it’s asked you to read a dataframe and produce some insights about it. There is a specific Pandas course on Kaggle to learn how to do more complex explorations in your data.
- My First Machine Learning Model: here are introduced the main steps of how to build and use a machine learning model — define, fit, predict, evaluate. But, before, it’s shown how to select your data to choose your target and features. In exercises, it’s asked to predict the houses sales prices using the Scikit-learn library and the “Melbourne house prices” dataset evaluating it with MAE.
- Model Validation: Some fundamentals concepts are explained in this more theoretical module. Here is introduced what is an evaluation, why evaluate your models, and what is the “In-Sample” score problem and how to fix it. In exercises, it’s asked to implement your train-test split and evaluate your model with MAE.
- Underfitting and Overfitting: Another bit more theoretical module. On this module is introduced the conceptions of overfitting and underfitting and how they are connected to models’ scores. It’s shown how to use max_leaf_nodes to try to diminish these effects. On exercises, it’s asked to try to improve the previous model, but now, knowing these new concepts.
- Random Forests: The 10th and 11th modules are about how to use a bit more complex model to take care of overfitting and underfitting. It’s explained why Random Forests lead to better scores and, then it’s asked to you try to implement your own Random Forest model instead of your decision tree.
- Machine Learning Competitions: In the last module, it’s asked to you try to submit your solution to the House Prices: Advanced Regression Techniques competition. The idea is to search for more features in the dataset to try to improve your Random Forest score. The author also suggests coming back after you’ve done other micro-courses to continue improving your score and seeing your evolution.
Hero’s Journey
Kaggle has a huge amount of micro-courses. The “Introduction to machine learning” course mainly unlocks the Intermediate machine learning and the Deep learning micro-course. In the Intermediate level, some more data science fundamentals like how to deal with missing values, non-numerical features, cross-validation, gradient boosting are exposed. In the deep learning course, is introduced tensor programming with TensorFlow. Computer vision, transfer learning and, data augmentation are some aborded points.
Beyond the machine learning micro-courses, Kaggle has a rich content of other subjects. Courses like Pandas and Data Visualization are essentials for any data science student and, in the same way of the machine learning courses, they are focused on fast and practical learning.
Following the Kaggle trail is a great kickoff to study more complex algorithms and techniques and start to build fancier and competitive models.
Conclusion
Kaggle is a lot about home. Its supportive community, almost collaborative competitions, and interactive environment are an amazing playground for any data science enthusiast to learn and have fun. Kaggle Learn shows up as a gateway for anyone who wishes to start learning about this world.
It’s not about turning an expert in some hours. The practicality turns the micro-courses faster and more fun than many others, but they don’t have the theoretical baggage to master the subject. It’s more about showing how fun and possible can be machine learning, deep learning, and data exploration. Therefore, Kaggle Learn introductory courses are a tool to bring new competitors, readers, discussers, writers, teachers, students and researchers to this marvelous area.