Information
This case study showcases the expertise of ProCogia’s Senior Data Analytics Consultant, Jens Sommerfeld, in optimizing data analysis with Redshift to support counterfeit detection and legal compliance for an online retailer.
The Challenge
One of the largest online retailers received a legal request to identify all customers in a specific state in the USA who might have received counterfeit items. These items belonged to the same shipping batch as those that had received a counterfeit complaint. (Some of this information is purposely vague due to existing NDAs.)
Technical Implementation
Identify all retail counterfeit items of the same Vendor shipment that were sold to the same state. Count the orders, items, sold units and item cost over a timeframe of 4 years.
Identify all retail vendors and ASINs (unique item IDs) that received a counterfeit complaint in the past four years.
During that timeframe, complaints were manually kept in an Excel sheet only. Deduplicate all vendor & ASIN combinations and create a static table to join with other data tables.
A prior analyst had created a bare-bones query. Upon review, the query was using incorrect tables. When an order is placed, an item gets assigned for accounting and inventory management purposes. However, the warehouse may pick the same item from a bin that does not necessarily belong to the assigned shipment or vendor. Therefore, the query had to be completely rewritten. Legal was informed that the previous rough estimates might not be correct.
Our systems were not connected to the inventory receiving system, making it impossible to track all items from the same shipment. I met with the inventory analyst, and together we worked out the average number of days it takes to sell out a batch by vendor to continue the analysis. The average was 60 days, but it was still impossible to determine whether the counterfeit complaint was received at the beginning of the shipment or after all items from that shipment had already been sold.
Our Redshift cluster did not store data that far back, so the data had to be queried directly from the Data Lake. However, the available tool was not designed to handle such complex queries directly from the Data Lake. I had to aggregate the data manually and load it into Redshift to continue the analysis.
Since I did not know the inventory batch volume or the exact timeline of when each item from a specific batch was sold, I presented our legal team with the following three approaches:
i. All items shipped to that state 60 days prior to the complaint date.
ii. All items shipped to that state within a dynamic date range: 30 days before and 30 days after the complaint date.
iii. All items shipped to that state 90 days before vendor enforcement.
Another hurdle was drop shipments, which included all items sent directly from the vendor to the customer. These types of shipments receive different vendor codes, which had to be manually reviewed and added to the static table created with an additional column differentiating regular orders from drop shipments.
Finalize the query, summarize the output, and send it to legal along with all pros and cons of each approach. Additionally, document all steps taken to complete this analysis in case it needs to be replicated.
Results
The details were shared with legal and the District Attorney of that state. I was not privy to any further information regarding the outcome or the legal team’s next steps. However, the legal team was very satisfied with my approach and process, so I was asked to train the IP Infringement Team on how to replicate my steps and produce a similar output.