In this article Eagle Alpha’s Director of Data Strategy and Analytics Ronan Crosson shares his insights on web crawled data based on content on Eagle Alpha’s Data Strategy solution.
Multiple surveys and our own experience at Eagle Alpha highlights web crawled data as one of the most popular categories of alternative data. There are multiple reasons for this, including breadth of applications, ease of access and low price points. In this article I will touch on what I consider be the most important aspects of this important alternative data category.
What is Web Crawled Data?
On one of our monthly alpha workshops recently focused on the topic of web crawled data. Web crawling was defined as a means of aggregating data “via a computer program which requests information from public URLs. The data can be collected in-house or by companies that specialize in customized data collection”.
For some time now funds have been gathering data from the websites of large online retailers including autos retailers, e-commerce sites, real estate listings, employment data and Online Travel Agencies. Web crawled data also constitutes a large portion of the data for other alternative data categories such as social media data, employment data, store location data, pricing data and ratings/reviews data.
In a survey by law firm Lowenstein Sandler in 2019 49% of funds responded that they used web crawled data and 57% plan to use the data in the next 12 months. It’s worth noting that 67% of respondents of the survey used Social media data which is frequently collected via web crawling.
Analysis of aggregated and anodized user click data from the Eagle Alpha platform reveals that web crawling datasets market share is declining but employment data and pricing data categories, which also utilize web crawling techniques, have seen an increase in market share.
The Value of Web Crawled Data:
On the same alpha workshop we explored the most common use cases for web crawled data.
Figure 1: Online Pricing and Discounting, Source: Eagle Alpha Data Partner
Figure 1 above is based on data that has been scraped from Lululemon’s website. The dataset tracks key KPI’s for the company’s online presence such pricing, discounting and SKU count. These metrics are measured for We Made Too Much (Inventory on sale), What’s New (new products), Bestsellers and aggregate level data for the Men’s and Women’s categories. The dataset has shown to be useful for tracking growth in the expanding men’s category and to measure the influence of pricing and discounting on margins.
Figure 2: Online Job Listings, Source: Eagle Alpha Data Partner
Another common use for web crawled data is tracking online job listings. Figure 2 shows an analysis of a job listings data-set that looked at the hiring of legal personnel at the some of the largest technology companies. The analysis showed that Facebook was hiring legal related staff at a much higher rate than its peers. This proved to be insightful as Facebook mentioned on its December quarter conference call that “G&A grew 87% largely driven by higher legal fees and settlements”. Backing out a settlement charge of $550m, G&A still grew 31% YoY in the quarter.
Figure 3: Social Media Data, Source: Eagle Alpha Data Partner
A final popular use case for web crawled data is social media analysis. Figure 3 shows social media mentions, and sentiment data for online streaming platforms. The plot on the left shows social media mentions excluding Netflix. The data revealed that Apple TV+ garnered a short-lived bounce in consumer interest when it launched in March of 2019. In contrast Disney+ saw a much higher count of mentions of several months. This was an indication of consumer interest in Disney+. The social media data was indicative of subscriber growth for Apple TV+ and Disney+ when the companies updated investors on quarterly conference calls.
[Data Strategy clients can click here to access an archive of 25 web scraping case studies from our Alpha Center.]
Challenges When Working with Web Crawled Data:
On our web crawling alpha workshop we also discussed the challenges of working with web crawled data. The two greatest challenges highlighted were history and legal considerations.
When engaging in an internal web crawling project we need to accept that historical data is typically not available. This is particularly challenging for ecommerce sites where historical pricing and availability are important KPIs a user might track.
There are some databases provided by organisations such as https://archive.org/ or https://commoncrawl.org/ but coverage is typically not sufficient for an investment application. Some sites will include historical data, most notably forums or rating and reviews sites where historical posts are available and clearly time stamped.
An alternative to an internal web crawling project is to work with a third-party provider. Frequently these will have historical data, but typically only for very niche applications. One example being employment data as highlighted earlier. Where web crawling providers do not have historical data they will most likely be able to help on a go-forward basis.
[Data Strategy clients can click here to access the full archive of alpha workshops.]
Next, we’ll address the second challenge of working with web crawled data – legal considerations.
Legal Considerations of Web Crawled data:
In that workshop Peter Greene from Lowenstein Sandler placed particular emphasis on the question of whether data on the web is public data. He concluded that it can be argued that web crawled data is in public domain, as long as you don’t need to put a password to view the information. As long as it’s considered public, the legal analysis takes you out of the insider trading realm in Peter’s opinion.
Lowenstein Sandler draw the line on password-protected content. Data from a section of a website that is behind a password is not public data in their opinion.
It is also important to note that the website operators have a lot of tools they can use to block someone from scraping. Lowenstein Sandler don’t recommend clients make efforts to circumvent these obstacles.
One of the highest profile cases involving web crawling is between Linkedin and a company called HiQ.
HiQ’s business is based on working with corporations with respect to job moves of employees. LinkedIn profiles had been the primary source of its data, and HiQ would search the entirety of LinkedIn’s database. However, HiQ received a cease and desist letter from LinkedIn in May 2017. HiQ complied and started to scrape only publicly available data.
LinkedIn then decided to prevent any kind of scraping – even of public information – and put technological barriers in place.
In June 2017, HiQ commenced an action for an injunction to allow it to continue to scrape public profiles. The United States District Court for the Northern District of California agreed with HiQ. LinkedIn appealed that decision to the United States Court of Appeals for the Ninth Circuit. On September 9, 2019, the Ninth Circuit rejected LinkedIn’s effort to stop HiQ from using information crawled from LinkedIn’s website.
Most observers have taken the rulings in the HiQ vs. LinkedIn case as evidence that web crawling is legal. We have written multiple articles on the case and we even dedicated an entire legal workshop the topic.
It’s also worth noting that regulations regarding web scraping vary by region. For instance, in the past we published an article discussing guidance on web crawling from The National Commission on Informatics and Liberty (CNIL), a French regulatory body. The guidance indicates that even if individual contact details are collected from public posts, it doesn’t mean that individuals were expecting their data to be harvested for “prospecting”. Therefore, the CNIL treats these public posts as personal data which cannot be used without consent as specified under the GDPR.
Peter Greene highlighted what he suggests to his clients who engage in web crawling:
- Develop a one pager scraping permission sheet.
- Carefully negotiate the agreements with scrapers and crawlers and negotiate the reps and the data provenance
This process will be proof to the regulator that a firm took the necessary steps when engaging in web crawling.
Web crawled data has consistently ranked as one of the most popular categories of alternative data due to its broad applications, ease of use and relative inexpensiveness. Although datasets tagged as web crawling have been losing share of clicks on Eagle Alpha’s platform, other datasets that rely on web crawled data such as employment data and pricing data are gaining share. The lack of historical data can sometimes be overcome through public databases or niche datasets from specialist vendor. The major legal consideration is whether the data is public and expert opinion suggests that web data is public as long as it’s not behind a password.
To learn more please contact: firstname.lastname@example.org
About the author :
Ronan Crosson, Director of Data Strategy & Analytics,
Ronan was a senior analyst at State Street Global Advisors. In his role as Director of Data Strategy & Analytics Ronan oversees the analyst team and advises some of the largest funds in the world on their data strategies.
You can contact Ronan at Ronan.email@example.com
You can contact Eagle Alpha at:
USA: +1 646 843 6048 UK: +44 (0) 20 7151 4880
Or Email: firstname.lastname@example.org