What tools do top Kaggle participants use

Our team in the top 2% at Kaggle


Whether books, clothes or garden tools: In the digital age, we are increasingly using online platforms such as Amazon, Otto.de or Zalando for our errands. But who hasn't been surprised at the hits that the respective online platform has offered them for their search term?

HomeDepot, an American hardware store chain that has also been selling its products online since September 2005, has launched a data science competition at kaggle.com to help optimize its product search algorithm. Specifically, the contest, which has now ended, was about predicting the relevance of products to certain search terms. In addition, the participants received a training set of over 70,000 pairs of search terms and product titles (including associated product descriptions). Each pair was assigned a value between 1 and 3, depending on how relevant the product is to the search query. The value 1 indicates a low relevance and the value 3 a high relevance. With the help of the training set, participants were asked to predict the importance of another 165,000 pairs of search terms and product items. Fig. 1: Product articles for the search term "AA battery" with respective relevance. Rheindata GmbH also took part in this competition and achieved the 35th place out of a total of 2,125 participants.

Our approach

Our approach is based on three different components: data cleansing, feature engineering and model building using machine learning algorithms. We used Python as an analysis tool, which has excellent modules for statistical data analysis, text mining and machine learning.

Data cleansing

After correcting spelling mistakes and removing stop words and punctuation marks, we determined the stem of the respective words in the search term, the product title and the product description. Furthermore, we identified relevant synonyms for the words in the search term and determined the bi- and trigrams (chains of two or three strings) for the respective search terms, product titles and product descriptions. Finally, we have assigned a so-called Tf-idf measure to each word, which weights frequently occurring words less strongly than less frequently occurring words. In this part of the analysis we used the data analysis module Pandas and the text mining module NLTK (Natural Language Toolkit) intensively. Fig. 2: Data cleansing for the search term "vaccuum cleaners for hardwood and carpet".

Feature engineering

In addition to the obvious features, such as the number of words in the search term and product title, the Tf-Idf value of the respective search term, the number of common words and N-grams in the search term and product title, we formed various quotients. We also used typical text mining coefficients such as the Jaccard and Dice coefficient or the cosine similarity. In addition, we classified the search terms and product titles using the bi- and tri-grams so that each search term or article was assigned to one of 50 different clusters. We also discretized all features, i.e. continuous features were typically divided into ten blocks. With the help of these blocks, we then generated statistical features, such as mean value and different quantiles for the respective block. In the end, we had created over 2,000 features that formed the basis for model building with the help of machine learning algorithms. Fig. 3: Feature engineering for a specific search term / product item pair.

Model building

In the last step, we used the training data with the generated features to build or train models with the help of various machine learning algorithms. In order to evaluate the quality of the different models and to avoid overfitting, we only used 80% of the training data for training and the remaining 20% ​​for validating the models. After tuning the respective machine learning algorithm parameters, our best (single) model was based on the gradient boosting algorithm XGBoost, with which many Kaggle competitions have been won. Our final model, which brought us 35th place, is a combination of several individual models, based on different machine learning algorithms and different selection of features, whereby the relevance was given by a weighted average of the various individual models. Fig. 4: Flowchart of our data analysis. If data science is also an interesting topic for you, please do not hesitate to contact us.
Copyright © rheindata 2021

imprint


rheindata GmbH
Folding rule belt 59
50969 Cologne


Company headquarters: Cologne
Cologne District Court, HRB 65152
Sales tax ID: DE263860353
Managing director: Frank Hecker


Liability notice:
Despite careful content control, we assume no liability for the content of external links. The operators of the linked pages are solely responsible for their content.


Legal notice:
All named product names are trademarks of the respective company named.


Header photo © Fotolia.com (davis)

Directions



rheindata GmbH
Brussels Str. 21
50674 Cologne