How to write web scraper (parser/grabber) and wrong ways to do it
Creation date: 2012-10-10 12:20:37
Last edited on: 2012-10-10 12:25:43
Today we are starting studying web scrapers. Also such scripts are called: parsers, grabbers.
What usual web scraper does? It extracts content from web-page, process it and... And that's it. It's just gather information and stores it in specific state.
In this tutorial we'll know how not to write such scripts (parsers, grabbers, scrappers) and will know the right way. And in the next tutorials we'll duscuss scraper making with using different libraries and languages. As for now:
Wrong way to code web scraper
The beginners in web data extraction usually don't know about specific tools that help with web page data extraction. So, they choose wrong tools. Usually these are: string replace, regular expressions and using DOM for parsing.
Web grabber with string replacement and regular expressions
The most brutal-force ways to extract or process data are string replacement and regular expressions. Never use it in your parsers!!!
When newbie gets HTML-content of the web-page which he wants to scrape, he searches for specific data with regular expressions. Why it's wrong? Because HTML language is to complex, to irregular to try parse it with regular expressions.
Web grabbing and DOM model
HTML DOM (Document Object Model) is used to manipulate HTML content of web pages. And at first glance it allows to do it easy. But that's not so. For example, it's hard to extract the content of specific class.
Although many scrapers allow using DOM for extracting data, don't use it. Because there is more handy tool. And here it is:
XML based web scrapers/grabbers/parsers
The only right way to parse HTML web pages is to build xml representation of HTMLdocument. Why? Because HTML is easely could be represented as XML.
And one more thing:
XPath
Many HTML parsers uses XPath. XPath is a special language for finding information in the XML documents. XPath for XML almost the same as HTML DOM is for HTML, but XPath is so more easy-to-use.
So, in the next tutorials we'll be using only HTML parsers which represent HTML document as XML. Some of them will be using XPath.
That's it.