How To Use Excel As A Web Scraping Tool
Microsoft Excel is one of the most preferred tools to manage information in a structured form. It is a multipurpose software with great features and capabilities. Here is how Excel can be used as a basic web scraping tool to extract web data directly into a worksheet. We will take the help of Excel web queries for the same.
Web queries are a feature of Excel which is used to fetch data on a web page into the Excel worksheet easily. Excel can automatically find tables on the webpage and would let you pick the particular table you need data from. Web queries can also be helpful in cases where an ODBC connection is not possible to maintain other than just extracting data from the web pages. Let’s see how web queries work and how scraping of HTML tables off the web using them can be done.
Moving Ahead
Let’s start with a simple Web query to scrape data from the Yahoo! Finance page. This page is an easier one to scrape and hence it’s a good fit for learning the method purpose. The page is also pretty straightforward and doesn’t have much important information in the form of links or images. Here is the URL we will be using for the tutorial:
http://finance.yahoo.com/q/hp?s=GOOG
Creating a new Web query:
1. Select the cell in which you want the data to appear.
2. Click on the Data-> From Web
3. The New Web query box will pop up as shown below.
4. Enter the web page URL you want to extract data from in the Address bar and click the Go button.
5. Click on the yellow-black buttons next to the table you want to extract data from.
6. After selecting the required tables, click on the Import button and you’re good to go. Excel will now start downloading the content of the selected tables into your worksheet.
Once you have the data scraped into your Excel worksheet, you can do several things like creating charts, sorting, formatting, etc. to better understand or present the data in a meaningful way.
Customize the query
Once a web query is created, you can customize it according to your requirements. To do this, access Web query properties by right-clicking on a cell with the extracted data. The page you were querying appears again, click on the Options button to the right of the address bar. A new pop up box will show where you can customize how the web query interacts with the target page. You get options to change some of the basic things related to web pages like redirections and formatting.
Besides, you can also alter the date range options by right-clicking on a random cell with the query results and selecting Data range properties. The data range properties dialog box will pop up where you can make the necessary changes. You can rename the data range to something you can easily recognize like ‘Stock Value’.
Auto-refresh
Auto-refresh is a distinct feature of web queries that’s worth mentioning, and which makes our Excel web scraper a truly powerful one. You can also make your extracted data auto-refreshing so that your Excel worksheet will update the data whenever there is a change in the source website. You can also set how often you need the data to be updated from the source web page from the data range options menu. The auto-refresh feature can be enabled by ticking the box beside ‘Refresh every’ and setting your preferred time interval for updating the data.
Web scraping at large scale
Although extracting data using Excel can be a great way to scrape HTML tables from the web, it is nowhere close to a real web scraping solution. This can come in handy if you are collecting data for your college research paper or you are a hobbyist looking for a cheaper way to get some data. If you want data for business, you will have to depend on an expert web scraping provider with expertise in dealing with web scraping at scale. Outsourcing the complicated process will also help you save some precious time that you can invest in marketing your business.