Distillery

200 proof web scraping

Overview

Build simple and maintainable web scrapers through configuration over code. Create one configuration file that contains instructions for making a single HTTP request and how to distill the returned html into easily consumable objects.

Installation

npm install distillery-js --save

API

Distillery(still, options)

distillery.distill(parameters, returnResponse)

distillery.parse(html)

Example
// still.js
module.exports = function(distillery) {
  return {
    process: {
      ...
    },
    models: [
      {
        name: 'postings',
        type: 'collection',
        elements: {
          title: 'td.title',
          id: 'td.postingID > small'
        },
        iterate: 'html > div',
      },
      {
        name: 'page',
        type: 'item',
        elements: {
          current: '.current_page',
          last: '.last_page'
        }
      }
    ]
  }
};

// test.js
var Distillery = require('distillery-js')
var still = require('./still.js')

console.log(Distillery(still).parse(html))
$ node test.js
[
    {
        current: 1,
        last: 10
    },
    [
        {
            id: 100
            title: 'A post title'
        },
        {
            id: 101
            title: 'Another post title'
        },
        {
            id: 102
            title: 'One more post title'
        }
    ]
]

distillery.expect.http_code(code)

The HTTP code to expect the response to contain.

distillery.expect.url(url)

distillery.expect.html_element(path, expected)

Ignite CLI

Overview

The purpose of the Ignite command line interface is to make testing stills quicker and simpler. Ignite allows stills to be run from the command line with detailed output to assist developers.

Installation

npm install distillery-js -g

Usage

$ distillery ignite <stillPath> [-o <option>...] [-p <parameter>...]

stillPath

The relative path to the still file

option

Set distillery CLI options which are detailed below. All model options have default values shown while process options have no defaults.

parameter

Set any parameters defined in the still's process.request.query, process.request.headers, or process.request.form section that would be normally set in the .distill method of the programmatic API. Parameters should follow the format of -p <name>=<value>

Example
$ distillery ignite ./still.js -p id=4809527167

Stills

Stills are the configuration object that defines how to retrieve and extract data from a web page. A still is a function that returns an object with a specific structure shown here:

module.exports = function(distillery) {
  return {
    process: {
      request: {
        ...
      },
      response: {
        ...
      }
    },
    models: [
      ...
    ]
  }
};

process

(object, required) - The prcocess object contains the definition for how to make the HTTP request and how to handle the response.

process.request

(object, required) - Data detailing how to complete a HTTP request.

process.request.url

(string, required) - URL string which may contain tokenized variables. See token section below for more information.

Tokens

Tokens are placeholders for variables in the process.request.url that can be passed into a still to modify the request. Tokens are encapsulated by { and } in the url string. These can be used to make use of the same still for multiple similar requests.

Example

In this example a GitHub username is passed into the distillery instance to save the profile page of a user. The username can be modified in the example.js script to scrape any user's profile.

// still.js
module.exports = function(distillery) {
  return {
    process: {
      request: {
        method: 'GET',
        url: 'https://github.com/{un}',
        query: 
            username: { name: 'un', required: true }
      }
  }
};
// example.js
var Distillery = require('distillery-js')
var still = require('./still.js')
var fs = require('fs')

Distillery(still)
  .distill({ username: 'achannarasappa' })
  .then(function(html) {
    fs.writeFileSync('user.html', html)
  })

Run with $ node ./example.js. The username variable can be modified in example.js to scrape the profile page of any github user.

process.request.method

(string, required) - The HTTP verb which may be GET, POST, PUT, or DELETE

process.request.query

(object, optional) - Parameters to be interpolated with url tokens. Object key refers to the name that can be used to set the parameter in the Distillery(still).distill() method.

process.request.query[<key>].name

(string, required) - Name of the token in the url string.

process.request.query[<key>].required

(boolean, optional) - If set to true, then an error will be thrown if this variable is not set in the Distillery(still).distill() method.

process.request.query[<key>].default

(string, optional) - Default value for the given parameter.

process.request.query[<key>].validate

(function, optional) - A DistilleryValidationError is thrown when a falsy value is returned from this function.

Example
context: {
  name: 'context',
  'default': 'user',
  validate: function(value) {
    return value === 'user' || value === 'admin';
  }
},
process.request.query[<key>].format

(function, optional) - Modifies the parameter before running any validation.

Example
context: {
  name: 'context',
  'default': 'user',
  format: function(value) {
    return value * 10;
  }
},

process.request.headers

(object, optional) - Parameters to be sent as headers. See Request documentation of headers for more information. Object key refers to the name that can be used to set the parameter in the Distillery(still).distill() method.

process.request.headers[<key>].name

(string, required) - Name of the http header.

process.request.headers[<key>].required

See process.request.query[<key>].required.

process.request.headers[<key>].default

See process.request.query[<key>].default.

process.request.payload

(object, optional) - Send request data as application/x-www-form-urlencoded. See Request documentation on forms for more information. Object key refers to the name that can be used to set the parameter in the Distillery(still).distill() method.

process.request.payload[<key>].name

(string, required) - Name of the form parameter.

process.request.payload[<key>].required

See process.request.query[<key>].required.

process.request.payload[<key>].default

See process.request.query[<key>].default.

process.response

Hooks, validators, and indicators to handle a HTTP response. Each sub-object is a potential response condition. This allows for handling 404 Not Found response differently from a 200 OK response.

process.response[<key>].indicators

(object, required) - List conditions to look for that would indicate this specific response was returned from the server. Each sub-key is a user-assigned name for the indicator. There are a set of indicator that can be used detailed below. Which combination or combinations of indicators constitues a specific response is in the process.response[<key>].validate function.

Indicators

Indicators are used to recognize if a particular response was returned. They are key value pairs where the keys are the name of the indicator and the value is an expected value in the HTTP response. There are three signatures that can be detected with distillery:

Example

The first two indicators are included with distillery. The last indicator is a user defined indicator that should return true or false.

...
  indicators: {
    success_url: distillery.expect.url('https://github.com/'),
    success_code: distillery.expect.http_code(200),
    success_custom: function(response) {
        return (response.statusMessage === 'Not found')
    }
  },
...

process.response[<key>].validate

(function, required) - Used to determine which response was returned from the the request. The result of each indicator can be used in this function to evaluate if the response is valid. If there are multiple responses that have validation functions that evaluate to true, the first response will be chosen.

Example

In the profile_success response, there are two indicators that will be evaluated, success_url and success_code. If the validate function returns true, the hook for this response will be triggered.

// still.js
module.exports = function(distillery) {
  return {
    process: {
      request: {
        method: 'GET',
        url: 'https://github.com/{un}',
        query: 
            username: { name: 'un', required: true }
      },
      response: {
        profile_success: {
          indicators: {
            success_url: distillery.expect.url('https://github.com/'),
            success_code: distillery.expect.http_code(200)
          },
          validate: function(indicators) {
            return (indicators.success_code && indicators.success_url);
          },
          hook: function(response) {
            console.log('Successfully retrieved the user's profile!');
          }
        }
      }
  }
};

process.response[<key>].hook

(function, optional) - Hook function to trigger after the response is validated. Only a single hook will be triggered in a still. If no hook is defined,

models

(array, optional) - Array of models that define how to extract an entity from HTML

models[<index>]

(object, optional) - Model definition for how to extract an entity from HTML

models[<index>].name

(string, required) - Name for the model

models[<index>].type

(string, required) -Model type which can be:

models[<index>].elements

(object, required) - Object containing all of the properties of the entites with the keys being the name of the property and value being either a string CSS path to to the HTML element or an object containing information about how to retireve the property.

models[<index>].elements[<key>]

(string, object, or function, required) - There are several possibilities for what is returned from a model depending on the value of this key detailed below:

Example

Here, distillery will run the function in elements.title and return the result rather than selecting a specific a CSS path as it does with elements.id.

models: [
    {
        type: 'item',
        elements: {
            id: 'div.id',
            title: function($) {
                return $('div#post-list > div').eq(0).html()
            }
        }
    }
]
models[<index>].elements[<key>].path

(string, required) - CSS path to a HTML element. Must also have a attr or regex property.

models[<index>].elements[<key>].attr

(string, optional) - Name of an HTML attribute to retrieve the value from

models[<index>].elements[<key>].regex

(string, optional) - Regular expression to test the inner text of the element at path

models[<index>].iterate

(string, required) - Required if type is collection, otherwise not needed. The CSS path that contains the items in a collection.

Example

With this snippet of a model, distillery with iterate over every table > tr in the HTML document and return the value from the td.title as an array.

models: [
    {
        type: 'collection',
        elements: {
          title: 'td.title'
        },
        iterate: 'table > tr'
    }
]

models[<index>].validate

(function, optional) - Validation function that should return a true value for entites that should be returned and false for entites that should be removed from the result array for collections.

models[<index>].format

(function, optional) - Formatting function that will be run over each entity in a collection to transform its values or on a single entity for an item.

validate: function(posting) {
  return (typeof posting.title !== "undefined")
},
format: function(posting) {
  posting.title = posting.title.trim();
  return posting;
}

Examples

See distillery-examples