What tools do top Kaggle participants use
Our team in the top 2% at Kaggle
HomeDepot, an American hardware store chain that has also been selling its products online since September 2005, has launched a data science competition at kaggle.com to help optimize its product search algorithm. Specifically, the contest, which has now ended, was about predicting the relevance of products to certain search terms. In addition, the participants received a training set of over 70,000 pairs of search terms and product titles (including associated product descriptions). Each pair was assigned a value between 1 and 3, depending on how relevant the product is to the search query. The value 1 indicates a low relevance and the value 3 a high relevance. With the help of the training set, participants were asked to predict the importance of another 165,000 pairs of search terms and product items. Fig. 1: Product articles for the search term "AA battery" with respective relevance. Rheindata GmbH also took part in this competition and achieved the 35th place out of a total of 2,125 participants.
Our approachOur approach is based on three different components: data cleansing, feature engineering and model building using machine learning algorithms. We used Python as an analysis tool, which has excellent modules for statistical data analysis, text mining and machine learning.
Data cleansingAfter correcting spelling mistakes and removing stop words and punctuation marks, we determined the stem of the respective words in the search term, the product title and the product description. Furthermore, we identified relevant synonyms for the words in the search term and determined the bi- and trigrams (chains of two or three strings) for the respective search terms, product titles and product descriptions. Finally, we have assigned a so-called Tf-idf measure to each word, which weights frequently occurring words less strongly than less frequently occurring words. In this part of the analysis we used the data analysis module Pandas and the text mining module NLTK (Natural Language Toolkit) intensively. Fig. 2: Data cleansing for the search term "vaccuum cleaners for hardwood and carpet".
Feature engineeringIn addition to the obvious features, such as the number of words in the search term and product title, the Tf-Idf value of the respective search term, the number of common words and N-grams in the search term and product title, we formed various quotients. We also used typical text mining coefficients such as the Jaccard and Dice coefficient or the cosine similarity. In addition, we classified the search terms and product titles using the bi- and tri-grams so that each search term or article was assigned to one of 50 different clusters. We also discretized all features, i.e. continuous features were typically divided into ten blocks. With the help of these blocks, we then generated statistical features, such as mean value and different quantiles for the respective block. In the end, we had created over 2,000 features that formed the basis for model building with the help of machine learning algorithms. Fig. 3: Feature engineering for a specific search term / product item pair.
Model buildingIn the last step, we used the training data with the generated features to build or train models with the help of various machine learning algorithms. In order to evaluate the quality of the different models and to avoid overfitting, we only used 80% of the training data for training and the remaining 20% for validating the models. After tuning the respective machine learning algorithm parameters, our best (single) model was based on the gradient boosting algorithm XGBoost, with which many Kaggle competitions have been won. Our final model, which brought us 35th place, is a combination of several individual models, based on different machine learning algorithms and different selection of features, whereby the relevance was given by a weighted average of the various individual models. Fig. 4: Flowchart of our data analysis. If data science is also an interesting topic for you, please do not hesitate to contact us.
Folding rule belt 59
Company headquarters: Cologne
Cologne District Court, HRB 65152
Sales tax ID: DE263860353
Managing director: Frank Hecker
Despite careful content control, we assume no liability for the content of external links. The operators of the linked pages are solely responsible for their content.
All named product names are trademarks of the respective company named.
Header photo © Fotolia.com (davis)
Brussels Str. 21
- How to correct ambiguous sentences
- What is a bloated sense of morality
- What is a perfect makeup for Halloween
- What does the certification of a document do?
- Where do people watch movies these days
- Review of the Meridian Health Log
- What roles do you play in an advertising agency?
- Is not an unphysiological keto diet
- What does juice from concentrate mean
- Is a thrombosis of the portal vein or splenic vein fatal
- Should HBO Watchman extend for another season
- Inspirational blogs have traffic
- PNC is a national bank
- Prince Harry is religious
- Can a boil kill you
- What are Growing Foods
- How do I develop belief in karma
- Wine attracts insects
- What are some substitutes for whipped cream
- Why are most Navy Seals conservative
- What does juice from concentrate mean
- Should Facebook introduce an image search option
- In what moment did you hate your life
- Is imprisonment barbaric too