4elements, web design and consultancy

  1. Building Your First Web Scraper, Part 2

    In this tutorial, you will learn how you can use Mechanize to click links, fill out forms, and upload files. You'll also learn how you can slice Mechanize page objects and how to automate a Google search and save its results.

    Topics

    • Single Page vs. Pagination
    • Mechanize
    • Agent
    • Page
    • Nokogiri Methods
    • Links
    • Click
    • Forms

    Single Page vs. Pagination

    So far we have spent some time figuring out how we can scrape the screen of a single page using Nokogiri. This was a good basis to move one step forward and learn how to extract content from multiple pages. 

    After all, the problem we're trying to solve involves getting the content from more than 140 episodes—which is more content than can reasonably fit a single web page. We have to work with pagination and need to figure out how to follow content down the rabbit hole. 

    This is where Nokogiri stops and another useful gem called Mechanize comes into play.

    Mechanize

    Mechanize is another powerful tool that has lots of goodies to offer. It essentially enables you to automate interactions with websites you need to extract content from. In that sense it reminds me a bit of some functionality that you might know from testing with Capybara

    Don’t get me wrong, playing with Nokogiri on a single page is awesome in itself, but for more spicy data extraction jobs, we need a bit more horsepower. We can essentially crawl through as many pages as we need and interact with their elements—imitating and automating human behavior. Pretty powerful stuff!

    This gem enables you to follow links, fill out form fields, and submit that data—even dealing with cookies is on the table. That means you can also imitate users' login into private sessions and get content from a site only you have access to. 

    You fill out the login with your credentials and tell Mechanize how to follow along. Since you can click links and submit forms, there is very little that you cannot do with this tool. It has a close relationship to Nokogiri and also depends on it. Aaron Patterson is again one of the authors of this lovely gem.

    Instantiating a Mechanize Agent

    Before we can start mechanizing things, we need to instantiate a Mechanize agent.

    some_scraper.rb

    This agent will be used to fetch a page, similar to what we did with Nokogiri.

    some_scraper.rb

    What happens here is that the Mechanize agent got the podcast page and its cookies.

    Extracting Page Content

    We now have a page that is ready for extraction. Before we do so, I recommend that we take a look under the hood using the inspect method.

    some_scraper.rb

    The output is quite substantive. Take a look and see for yourself what a Mechanize::Page object consists of. Here you can see all the attributes for that page. 

    To me, this is a really handy object to slice up the data you want to extract.

    Output

    If you want to take a look at the HTML page itself, you can tag on the body or content methods.

    some_scraper.rb

    Output

    Since this podcast has only a small number of different elements on the page, here is the Mechanize::Page that gets returned from github.com. It has a bigger variety of content to take a look at. I think this is important to get a feel for.

    Output github.com

    Back to the podcast, you can also look at things like encodings, the HTTP response code, the URI, or the response headers.

    some_scraper.rb

    Output

    There is lots more stuff if you want to dig deeper. I’ll leave it at that.

    Nokogiri Methods

    • at
    • search

    Mechanize uses Nokogiri to scrape data from pages. You can apply what you learned about Nokogiri in the first article and use it on Mechanize pages as well. That means that you generally use Mechanize to navigate pages and Nokogiri methods for your scraping needs. 

    For example, if you want to search a single object, you can use at, while search returns all objects that will match a selector on a particular page. To rephrase that, these methods will work both on Nokogiri document objects and Mechanize page objects.

    some_scraper.rb

    Output

    Links

    • links
    • link_with
    • links_with

    We can also navigate the whole site to our liking. Probably the most important part of Mechanize is its ability to let you play with links. Otherwise you could pretty much stick with Nokogiri on its own. Let’s take a look what we get returned if we ask a page for its links.

    some_scraper.rb

    Output

    Holy moly, let’s break this down. Since we haven’t told Mechanize to look elsewhere, we got an array of links from only that very first page. Mechanize goes through that page in descending order and returns you this list of links from top to bottom. I have created a little image with green pointers to the various links that you can see in the output. 

    By the way, this is already showing you the end result of the redesign for my podcast. I think this version is a bit better for demonstration purposes. You also get a glimpse of how the final result looks and why I needed to scrape my old Sinatra site.

    Screenshot

    Podcast Links

    As always, we can also extract just the text from that.

    some_scraper.rb

    Output

    Getting all these links in bulk can be very useful or simply tedious. Luckily for us, we have a few tools in place to fine tune what we need.

    some_scraper.rb

    Output

    Boom! Now we are getting somewhere! We can zoom in on specific links like that. We can target links that match a certain criteria—like its text, for example—with a nicer API like links_with or link_with. Also, if we have multiple Focus links, we could zoom in on a particular number on the page using brackets [].

    some_scraper.rb

    If you are not after the link text but the link itself, you only need to specify a particular href to find that link. Mechanize won’t stand in your way. Instead of text, you feed the methods with href.

    some_scraper.rb

    If you only want to find the first link with the desired text, you can also make use of this syntax. Very convenient and a bit more readable.

    some_scraper.rb

    What about following that fella and seeing what hides behind this Focus link? Let’s click it!

    Click

    some_scraper.rb

    This would get us another long list of links like before. See how easy it was to combine .click.links. Mechanize clicks the link for you and follows the page to the new destination. Since we also requested a list of links, we will get all the links that Mechanize can find on that new page.

    Let’s say I have two text links of the same interviewee—one that links to tags and one to a recent episode—and I want to get the links from each of these pages. 

    some_scraper.rb

    This would give you a list of links for both pages. You iterate over each link for the interviewee, and Mechanize follows the clicked link and collects the links it finds on the new page for you. Below you can find a few examples where you can compare combinations to get you started.

    some_scraper.rb

    Forms

    • submit
    • field_with
    • checkbox_with
    • radiobuttons_with
    • file_uploads

    Let’s have a look at forms!

    some_scraper.rb

    Output

    Because we use the forms method, we get an array returned—even when we only have one form returned to us. Now that we know that the form has the name "f", we can use the singular version form to hone in on that one.

    some_scraper.rb

    Using form('f'), we singled out the particular form we want to work with. As a result, we will not get an array returned.

    Output

    We can also identify the name of the text input field (q).

    We can target it by that name and set its value like Ruby attributes. All we need to do is provide it with a new value. You can see from the output example above that it is empty by default.

    some_scraper.rb

    Output

    As you can observe above, the value for the text field has changed to New Google Search. Now we only need to submit the form and collect the results from the page that Google returns. It couldn’t be any easier. Let’s search for something else this time!

    some_scraper.rb

    Here I identified the search results header using a CSS selector h3.r, mapped its text, and pretty printed the results. Wasn’t that hard, was it? That is an easy example, sure, but think about the endless possibilities you have at your disposal with this!

    Output

    Mechanize has different input fields available for you to play with. You can even upload files!

    • field_with
    • checkbox_with
    • radiobuttons_with
    • file_uploads

    You can also identify radio buttons and checkboxes by their name and check them with—you guessed it—check.

    some_scraper.rb

    Option tags offer users to select one item from a drop-down list. Again, we target them by name and select the option number we want.

    some_scraper.rb

    File uploads work similar to inputing text into forms by setting it like Ruby attributes. You identify the upload field and then specify the file path (file name) you want to transfer. It sounds more complicated than it is. Let’s have a look!

    some_scraper.rb

    Final Thoughts

    See, no magic after all! You are now well equipped to have some fun on your own. There is certainly a bit more to learn about Nokogiri and Mechanize, but instead of spending too much time on unnecessary aspects, play around with it and look into some more documentation when you run into problems beyond the scope of a beginner article.

    I hope you can see how beautifully simple this gem is and how much power it offers. As we all know from popular culture by now, this also bears responsibilities. Use it within legal frameworks and when you have no access to an API. You probably won’t have a frequent use for these tools, but boy do they come in handy when you have some real scraping needs ahead of you.

    As promised, in the next article we will cover a real-world example where I will scrape data from my podcast site. I will extract it from an old Sinatra site and transfer it over to my new Middleman site that uses .markdown files for each episode. We will extract the dates, episodes numbers, interviewee names, headers, subheaders, and so on. See you there!

     

    0 Comments

    Leave a comment › Posted in: Daily

0 Comments

Got anything to add?

(Basic HTML is fine)