eCommerce Product Type Identification

The first step in processing product data is identifying which type of product it is, both generally and specifically. This enables automatically classifying, extracting attributes, standardizing values, inferring missing data, analyzing quality issues, and enhancing the product data.

The difficulties are that humans are amazing inconsistent in how they enter data with issues of typos, omitted words, abbreviations, synonyms (e.g. hose means something very different in apparel than it does in auto parts), and that conceptually there is a blurred line between what a product fundamentally is and the attributes that separate it from other similar product types.

The approach I came up with was to use a word ngram (phrase) analysis against known product data and word vectors to establish root product types that couldn’t be broken further down into a smaller phrase without significantly altering its semantic meaning. Then using NLP word roots and word vectors to standardize those product types within an ecommerce product catalog.