Hot to make parser (scraper/grabber). Part 1
Creation date: 2012-10-15 09:56:47
Last edited on: 2012-10-15 10:06:56
In this tutorial we will start making parser of web pages. We will parse data from this site (shatalov.su). But after reading all tutorials about parsers you'll can parse any site.
You'll need Visual C# 2012 (or lower, but then you need to convert my samples by yourself). We'll work with C# with HtmlAgilityPack. How to install HtmlAgilityPack I'll teach you below.
Also in this tutorial we'll load the page we want to parse in our program. Let's go!
HtmlAgilityPack setup
To use latest version of HtmlAgilityPack in your application you need to have NuGet package manager extension. Let's check, do you have it installed in your Visual Studio: pick menu item Tools → Extensions and Updates:
In the left menu choose Installed → All. In the central part of the window you'll see all extensions that are installed in your Visual Studio.
If there is NuGet Package Manager, then close this window and read How to install HtmlAgilityPack. And if you don't have this extension, in the left menu choose Online → Visual Studio Gallery:
NuGet extension will be first. Click Download and install it - it's easy. After that close Extensions and Updates window.
How to install Html Agility Pack
Open menu item Tools → Library Package Manager → Package Manager Console:
On the bottom side of the window Package Manager Console will be opened. Type: Install-Package HtmlAgilityPack and wait some time:
You'll see this notification in console: Successfully installed 'HtmlAgilityPack 1.4.6'. Restart your Visual Studio and you can use HtmlAgilityPack parser in your applications.
Starting with HtmlAgilityPack
Let's create console application. We don't need any standard namespaces by now, so I just remove all using directives.
To use HtmlAgilityPack objects you need to use HtmlAgilitypack namespace:
!1? using HtmlAgilityPack;?1!As I said in the beginning of the tutorial we'll parse shatalov.su. Now we'll download one page from web site and store it in the local html file:
!1? HtmlWeb web = new HtmlWeb(); HtmlDocument doc = web.Load("http://shatalov.su/en/game_programming.php"); doc.Save("localfile.html");?1!First line creates object HtmlWeb. HtmlWeb is used to download needed web page. It has method Load to do so. In braces you need to point URL you want to download. web.Load returns HtmlDocument. That's the main class with wich we'll work in next tutorials. This class has method Save wich takes one argument - local file name.
Then you build this program. Go to you project folder → bin → Debug (or Release). Then run the application. My is called parser.exe. When I run it, it immediately closes but creates localfile.html wich contains downloaded web page.
In next tutorials we'll work with HtmlDocument class and learn how to parse HTML content.
That's it.