An AI-powered web scraper that extracts website content using Selenium and processes it with AI models like Gemini to generate insights based on user prompts. The application features an interactive Streamlit frontend for real-time content analysis.
- Extracts website content using Selenium
- Cleans and structures scraped data with BeautifulSoup
- Uses Gemini API for AI-driven insights and analysis
- Accepts user-provided URL and prompt for customized scraping
- Displays results in a Streamlit-based interactive UI
- Backend: Python, Selenium, BeautifulSoup
- AI Processing: Gemini API, LangChain, OpenAI
- Frontend: Streamlit, HTML, CSS, JavaScript
-
Clone the Repository
git clone https://github.com/yourusername/ai-web-scraper.git cd ai-web-scraper -
Create and Activate Virtual Environment
python3 -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate`
-
Install Dependencies
pip install -r requirements.txt
-
Set Up API Key
- Create a
.envfile in the project directory and add your Gemini API key:GEMINI_API_KEY=your_api_key_here
- Create a
-
Run the Application
streamlit run app.py
- Open the Streamlit web UI.
- Enter a website URL and a prompt describing the data you want to extract.
- Click the "Scrape" button to retrieve and process the data.
- View the extracted content and AI-generated insights.
Extract all product names and prices from this e-commerce website.
- Support for multiple AI models (e.g., OpenAI, Groq, Ollama)
- Improved data visualization in Streamlit
- Integration with databases for storing scraped content
This project is licensed under the MIT License.
Pull requests are welcome! Feel free to submit issues or feature requests.
For any inquiries, reach out via [your email] or open an issue on GitHub.