Hi, I’m Andrew. I want to tell you about my first open source experience and the project which was developed - Goose Parser.
Going back to the end of 2015, when the story begins, I had been working as a Senior PHP Developer in one travel company, let’s call it “D”. And at that moment I had no experience with Node.js. The language was picked because of the many useful tools like PhantomJS, Puppeteer and others with good APIs and functionality to do anything on the webpage.
So, why was it called a weird name like Goose?! Good question. The original name was — Fantastic Unified Crawler Kit, you can imagine the acronym for this name. That was funny enough, but not something you can promote and sell as a product, too dark… About that time, I had a chance to work at D’s London office for a few weeks. I really loved the people, the TravelTech community, city, pubs, parks, and especially a lot of different animals which were freely coming and going in the city’s green zones. Squirrels, geese, swans, deers… Finally, I went to Hamleys toy shop and found a goose puppet. It has its own character and point of view. So, the final decision was made and the name of the project was changed. After awhile we captured a short video trailer about the Goose. Here it is:
To be honest, there are plenty of existing frameworks for scrapping out there. And Goose is supposed to be one of them. However, we were planning to create something, that you can run by yourself, and also scale this to a platform, where you can execute scraping script in the cloud, share it with anybody, and even sell it within a marketplace.
So, let’s have a deeper look into the features Goose can provide:
Here is a simple example of how Goose can extract the web for you.
const Parser = require('goose-parser');
const ChromeEnvironment = require('goose-chrome-environment');
const env = new ChromeEnvironment({
url: 'https://www.google.com/search?q=goose-parser',
});
const parser = new Parser({ environment: env });
(async function () {
try {
const results = await parser.parse({
actions: [
{
type: 'wait',
timeout: 10 * 1000,
scope: '.srg>.g',
parentScope: 'body'
}
],
rules: {
scope: '.srg>.g',
collection: [[
{
name: 'url',
scope: 'h3.r>a',
attr: 'href',
},
{
name: 'text',
scope: 'h3.r>a',
}
]]
}
});
console.log(results);
} catch (e) {
console.log('Error occurred:');
console.log(e.stack);
}
})();
And here is a CLI usage example (Docker based):
docker run -it --rm -e "DEBUG=*,-puppeteer:*"
redcode/goose-parser:chrome-1.1.3-parser-0.6.0 \
https://www.google.com/search?q=goose-parser \
'{
"actions": [
{
"type": "wait",
"scope": ".g"
}
],
"rules": {
"scope": ".g",
"collection": [
[
{
"scope": ".r>a h3",
"name": "name"
},
{
"scope": ".r>a:eq(0)",
"name": "link",
"attr": "href"
}
]
]
}
}'
Goose is a fancy web scrapping framework, which was started as an open source tool in one repo. Later, Goose parts was moved to separate repositories to minimize the original library size and to allow scraper’s execution in multiple environments. It has many useful features and it can be run as a Docker container, which significantly simplifies the process of development of new scrappers.
If so, then head over to goose.show to find more details! It’s free and open source! If you have any questions or feedback, you can either:
Thanks for your time!