Scraping Options

GPT Researcher now offers various methods for web scraping: static scraping with BeautifulSoup, dynamic scraping with Selenium, and High scale scraping with Tavily Extract. This document explains how to switch between these methods and the benefits of each approach.

Configuring Scraping Method

You can choose your preferred scraping method by setting the SCRAPER environment variable:

For BeautifulSoup (static scraping):
```
export SCRAPER="bs"
```
For dynamic browser scraping, either with Selenium:
```
export SCRAPER="browser"
```
Or with NoDriver (ZenDriver):
```
export SCRAPER="nodriver"
pip install zendriver
```
For production use cases, you can set the Scraper to tavily_extract or firecrawl. Tavily allows you to scrape sites at scale without the hassle of setting up proxies, managing cookies, or dealing with CAPTCHAs. Please note that you need to have a Tavily account and API key to use this option. To learn more about Tavily Extract see here. Make sure to first install the pip package tavily-python. Then:
```
export SCRAPER="tavily_extract"
```
FireCrawl is also allows you to scrape sites at scale. FireCrawl also provides open source code to self hosted server which provided better scrape quality compared to BeautifulSoup by passing markdown version of the scraped sites to LLMs. You will needs to have FireCrawl account (official service) to get API key or you needs self host URL and API key (if you set for your self host server) to use this option. Make sure to install the pip package firecrawl-py. Then:
```
export SCRAPER="firecrawl"
```

Note: If not set, GPT Researcher will default to BeautifulSoup for scraping.

Scraping Methods Explained

BeautifulSoup (Static Scraping)

When SCRAPER="bs", GPT Researcher uses BeautifulSoup for static scraping. This method:

Sends a single HTTP request to fetch the page content
Parses the static HTML content
Extracts text and data from the parsed HTML

Benefits:

Faster and more lightweight
Doesn't require additional setup
Works well for simple, static websites

Limitations:

Cannot handle dynamic content loaded by JavaScript
May miss content that requires user interaction to display

Selenium (Browser Scraping)

When SCRAPER="browser", GPT Researcher uses Selenium for dynamic scraping. This method:

Opens a real browser instance (Chrome by default)
Loads the page and executes JavaScript
Waits for dynamic content to load
Extracts text and data from the fully rendered page

Benefits:

Can scrape dynamically loaded content
Simulates real user interactions (scrolling, clicking, etc.)
Works well for complex, JavaScript-heavy websites

Limitations:

Slower than static scraping
Requires more system resources
Requires additional setup (Selenium and WebDriver installation)

NoDriver (Browser Scraping)

Alternative to Selenium for potentially better performance.

Setup:

pip install zendriver

Tavily Extract (Recommended for Production)

When SCRAPER="tavily_extract", GPT Researcher uses Tavily's Extract API for web scraping. This method:

Uses Tavily's robust infrastructure to handle web scraping at scale
Automatically handles CAPTCHAs, JavaScript rendering, and anti-bot measures
Provides clean, structured content extraction

Benefits:

Production-ready and highly reliable
No need to manage proxies or handle rate limiting
Excellent success rate on most websites
Handles both static and dynamic content
Built-in content cleaning and formatting
Fast response times through Tavily's distributed infrastructure

Setup:

Create a Tavily account at app.tavily.com
Get your API key from the dashboard
Install the Tavily Python SDK:
```
pip install tavily-python
```
Set your Tavily API key:
```
export TAVILY_API_KEY="your-api-key"
```

Usage Considerations:

Requires a Tavily API key and account
API calls are metered based on your Tavily plan
Best for production environments where reliability is crucial
Ideal for businesses and applications that need consistent scraping results

FireCrawl (Recommended for Production)

When SCRAPER="firecrawl", GPT Researcher uses FireCrawl Scrape API for web scraping in markdown format. This method:

Uses FireCrawl's robust infrastructure to handle web scraping at scale
Or uses self-hosted FireCrawl server.
Automatically handles CAPTCHAs, JavaScript rendering, and anti-bot measures
Provides clean, structured content extraction in markdown format.

Benefits:

Production-ready and highly reliable
No need to manage proxies or handle rate limiting
Excellent success rate on most websites
Handles both static and dynamic content
Built-in content cleaning and formatting
Fast response times through FireCrawl's distributed infrastructure
Ease of setup with FireCrawl self-hosted

Setup (official service by FireCrawl):

Create a FireCrawl account at firecrawl.dev/app
Get your API key from the dashboard
Install the FireCrawl Python SDK:
```
pip install firecrawl-py
```

Set your FireCrawl API key:

export FIRECRAWL_API_KEY=<your-firecrawl-api>

Setup (with self-hosted server):

Host your FireCrawl. Read their self-hosted guidelines or run locally guidelines
Get your server URL and API key (if you set it).
Install the FireCrawl Python SDK:
```
pip install firecrawl-py
```

Set your FireCrawl API key:

export FIRECRAWL_API_KEY=<your-firecrawl-api>

Note: FIRECRAWL_API_KEY can be empty if you not setup authentication for your self host server (FIRECRAWL_API_KEY=""). There will be some difference between their cloud service and open source service. To understand differences between FireCrawl option read here.

Usage Considerations:

Requires a FireCrawl API key and account or self-hosted server
API calls are metered based on your FireCrawl plan (it can be basically free with self-hosted FireCrawl method)
Best for production environments where reliability is crucial (for their cloud service)
Ideal for businesses and applications that need consistent scraping results
Need robust scraping option for personal use

Additional Setup for Selenium

If you choose to use Selenium (SCRAPER="browser"), you'll need to:

Install the Selenium package:
```
pip install selenium
```
Download the appropriate WebDriver for your browser:
- For Chrome: ChromeDriver
- For Firefox: GeckoDriver
- For Safari: Built-in, no download required
Ensure the WebDriver is in your system's PATH.

Choosing the Right Method

Use BeautifulSoup (static) for:
- Simple websites with mostly static content
- Scenarios where speed is a priority
- When you don't need to interact with the page
Use Selenium (dynamic) for:
- Websites with content loaded via JavaScript
- Sites that require scrolling or clicking to load more content
- When you need to simulate user interactions

Troubleshooting

If Selenium fails to start, ensure you have the correct WebDriver installed and it's in your system's PATH.
If you encounter an ImportError related to Selenium, make sure you've installed the Selenium package.
If the scraper misses expected content, try switching between static and dynamic scraping to see which works better for your target website.

Remember, the choice between static and dynamic scraping can significantly impact the quality and completeness of the data GPT Researcher can gather. Choose the method that best suits your research needs and the websites you're targeting.

Configuring Scraping Method​

Scraping Methods Explained​

BeautifulSoup (Static Scraping)​

Selenium (Browser Scraping)​

NoDriver (Browser Scraping)​

Tavily Extract (Recommended for Production)​

FireCrawl (Recommended for Production)​

Additional Setup for Selenium​

Choosing the Right Method​

Troubleshooting​

Configuring Scraping Method

Scraping Methods Explained

BeautifulSoup (Static Scraping)

Selenium (Browser Scraping)

NoDriver (Browser Scraping)

Tavily Extract (Recommended for Production)

FireCrawl (Recommended for Production)

Additional Setup for Selenium

Choosing the Right Method

Troubleshooting