All Articles

Goose. The beginning of the story…

Alt

Hi, I’m Andrew. I want to tell you about my first open source experience and the project which was developed - Goose Parser.

Going back to the end of 2015, when the story begins, I had been working as a Senior PHP Developer in one travel company, let’s call it “D”. And at that moment I had no experience with Node.js. The language was picked because of the many useful tools like PhantomJS, Puppeteer and others with good APIs and functionality to do anything on the webpage.

Name it

So, why was it called a weird name like Goose?! Good question. The original name was — Fantastic Unified Crawler Kit, you can imagine the acronym for this name. That was funny enough, but not something you can promote and sell as a product, too dark… About that time, I had a chance to work at D’s London office for a few weeks. I really loved the people, the TravelTech community, city, pubs, parks, and especially a lot of different animals which were freely coming and going in the city’s green zones. Squirrels, geese, swans, deers… Finally, I went to Hamleys toy shop and found a goose puppet. It has its own character and point of view. So, the final decision was made and the name of the project was changed. After awhile we captured a short video trailer about the Goose. Here it is:

To be honest, there are plenty of existing frameworks for scrapping out there. And Goose is supposed to be one of them. However, we were planning to create something, that you can run by yourself, and also scale this to a platform, where you can execute scraping script in the cloud, share it with anybody, and even sell it within a marketplace.

Features

So, let’s have a deeper look into the features Goose can provide:

  1. It can be run on different environments with the same result. That simply means, that you can use the same scrapper on the browser, Chromium, PhantomJs, JsDom, etc.
  2. It abstracts away the real environment and gives certain actions and transformations which can be executed all together with scrapping.
  3. The rules for the scrapper are defined in JSON format, so you can easily save it in a database, as well as share and adopt with other users, who want to scrape the same content.
  4. The result of scrapping is also presented as JSON data.
  5. In the box, Goose is crafted with multiple environments in separate Docker containers, so you can simply run it as a CLI command having nothing except Docker on your machine.
  6. We use debug library to simplify the process of debugging in case something goes wrong.
  7. Goose provides plenty of extra components which significantly saves your time on common scrapping tasks, like — pagination, captcha/blocking detection, proxy rotation and many more.

Let it code

Here is a simple example of how Goose can extract the web for you.

const Parser = require('goose-parser');
const ChromeEnvironment = require('goose-chrome-environment');

const env = new ChromeEnvironment({
  url: 'https://www.google.com/search?q=goose-parser',
});

const parser = new Parser({ environment: env });

(async function () {
  try {
    const results = await parser.parse({
      actions: [
        {
          type: 'wait',
          timeout: 10 * 1000,
          scope: '.srg>.g',
          parentScope: 'body'
        }
      ],
      rules: {
        scope: '.srg>.g',
        collection: [[
          {
            name: 'url',
            scope: 'h3.r>a',
            attr: 'href',
          },
          {
            name: 'text',
            scope: 'h3.r>a',
          }
        ]]
      }
    });
    console.log(results);
  } catch (e) {
    console.log('Error occurred:');
    console.log(e.stack);
  }
})();

And here is a CLI usage example (Docker based):

docker run -it --rm -e "DEBUG=*,-puppeteer:*"
    redcode/goose-parser:chrome-1.1.3-parser-0.6.0 \
    https://www.google.com/search?q=goose-parser \
    '{
      "actions": [
        {
          "type": "wait",
          "scope": ".g"
        }
      ],
      "rules": {
        "scope": ".g",
        "collection": [
          [
            {
              "scope": ".r>a h3",
              "name": "name"
            },
            {
              "scope": ".r>a:eq(0)",
              "name": "link",
              "attr": "href"
            }
          ]
        ]
      }
    }'

Final thoughts

Goose is a fancy web scrapping framework, which was started as an open source tool in one repo. Later, Goose parts was moved to separate repositories to minimize the original library size and to allow scraper’s execution in multiple environments. It has many useful features and it can be run as a Docker container, which significantly simplifies the process of development of new scrappers.

Interested in giving it a try?

If so, then head over to goose.show to find more details! It’s free and open source! If you have any questions or feedback, you can either:

  1. Comment on this article.
  2. Drop me a message on Twitter or LinkedIn
  3. Subscribe to the telegram channel Parser

Thanks for your time!

Published Apr 30, 2019

Passionate software engineer with expertise in software development, microservice architecture, and cloud infrastructure.