Skip to main content

Product data scraping

How Bambuser Live Shopping product data scraping works

This document provides a more detailed view of how the Bambuser Live Shopping Product data scraping works.

Test our product data scraper

If you do not have access to our dashboard, you can still test how our product scraper will behave with your product URLs. Visit this page and enter your product URL to test.

img

When you add a product to a show in the Bambuser Live Shopping Dashboard, some basic product details are scraped from the content of the given product URL (e.g. https://yourcompany.com/products/pink-shirt).

The following properties will be extracted from the page:

  • a product name or title
  • an image URL (product thumbnail)
  • a brand name
  • a reference (often called SKU - these can all be fetched or entered manually by the admin when setting up the show)

All fields can also be inserted and modified manually, however, the Bambuser product scraper tends to reduce the manual work by automating the product data insertion and make the consumers' life easier.

img

How does the product scraper work?

img

The scraper looks for different kinds of structured product data and metadata, using the following priority order:

  1. Schema.org markup
    1. JSON-LD
    2. Microdata
  2. OpenGraph meta-tags (og:)
  3. Generic HTML tags
note

If the scraper is not able to find a product reference (SKU) it will use the provided product URL as a reference.

1. Schema.org markup

Specification: https://schema.org/Product

Google's testing tool can be used to see if your site supports this: https://search.google.com/structured-data/testing-tool/u/0/

Example:

<script type="application/ld+json">
{
"@type": "Product",
"@context": "http://schema.org/",
"name": "My Product Name",
"description": "My Description",
"brand": { "@type": "Thing", "name": "My Brand Name" },
"image": "https://yoursite.com/path-to-image.jpg",
"sku": "product-sku-12345"
}
</script>

Microdata Beta

Exampe:

<div itemscope itemtype="http://schema.org/Product">
<span itemprop="name">My Product Name</span>
<span itemprop="brand">My Brand Name</span><br>
<img itemprop="image" src="https://yoursite.com/path-to-image.jpg"><br>
<span itemprop="description">Some optional description</span><br>
Product number: <span itemprop="sku" content="product-sku-12345"></span><br>
</div>

2. OpenGraph meta-tags (og:)

Specification: https://developers.facebook.com/docs/payments/product/

Example:


<meta property="og:type" content="og:product" />
<meta property="og:title" content="My Product Name" />
<meta property="product:brand" content="My Brand Name" />
<meta property="og:image" content="http://path-to-thumbnail" />
<meta property="og:description" content="Some optional description!" />
<meta property="product:retailer_item_id" content="product-sku-12345" />

3. Generic meta tags

If the aforementioned structured product data are not found, the product scraper looks for generic information found on most websites such as the title element, images.

<head>
<title> My Product Page Name </title>
</head>
<body>
...
<img src="https://yoursite.com/path-to-image.jpg">
...
</body>

note

The Bambuser Product Scraper server is located in the US. Your assets need to be accessible from US-based IP addresses. Otherwise, you need to whitelist our product scraper as described in the following.

Whitelist the scraper

An example use case for when you need to whitelist our scraper is when you intend to add products from your staging/test environment that is not publicly accessible. You can make an exemption for our scraper user-agent or whitelist static IP address.

By User-agent:

The scraper will identify itself with the following user-agent: BambuserLiveShopping/1.0. You can make an exception for requests made by this user-agent.
Once whitelisted the user-agent, it should start working right away.

By Static IP address:

You can also whitelist our scraper through the static IP address:

  • Global server: 35.224.84.15
  • EU server: 35.240.106.166
Static IP Activation

Beside whitelisting our static IP address from your side, you also need to inform Bambuser staff to enable 'Static IP proxy' for your organization.

Whitelist Bambuser Image Transformer

If your products are scraped correctly, but the images/thumbnails are not shown properly, you may need to whitelist our image transformer.
This can be due to restrictions from your CDN that blocks requests from our cloud-based image transformer.

This issue is also common if you are using Akamai CDN with strict rules.

Solution
Follow the same process as for whitelisting the scraper on your CDN rules.
For best performance, we recommend that you only whitelist Bambuser Image Transformer by user-agent (BambuserLiveShopping/1.0) as this option does not require an additional proxy stage.


Product data scraping FAQ

We highly recommend you to use the JSON-LD format of the Schema.org/Product

Absolutely! You can then update product details such as Title and Thumbnail manually. The product scraper is only a tool to automate manual data insertion and make your life easier.

  • Ensure your URL is also accessible from the US regions since our scraper is located in the United States. If you are outside of the US, you can test this using a VPN.
  • You can check the forwarded error response in the network tab of your browser dev tool.

For the scraper to initialize the product template fields with correct values, there must be valid Schema.org/Product structured data available and accessible on your product page.

  • Ensure the structured data is valid and does not have critical errors using Google's testing tool
  • Ensure that the structured data are available on the page load response and not loaded and rendered after the main page request.

How to check that?
Navigate to your product page (same URL you are trying to add), right-click, and select the "View Page Source" to open the response of the request to the product URL. Then look into the source code and double-check if the structured data exist, is valid, and has correct values.

There might be errors in JSON-LD data that Google Structured data tool does not complain about.
Use https://json-ld.org/playground/ to check the validity of your JSON-LD data.

If everything still looks fine on your side, contact the support department at support@bambuser.com