Resources

A Resource is a Django abstract model designed to easily connect to a data source. Currently such data sources can be:

  • API’s

  • Shell commands

  • Websites

The Resource makes connecting to data sources easier because:

  • It leverages common patterns in retrieving data, which means you can write less code

  • Once retrieved it will store data and return that data on subsequent uses

  • Resources use the Datagrowth configurations which makes them re-usable in different contexts

Getting started

To get started with gathering data through resources you need to pick a Resource class based on the data source you want to connect to. You can choose between HttpResource (for API connections and scraping websites) or ShellResource (for getting data from local commands). You’ll at least need to declare some class attributes on your own Resource to gather data from a source. Before this gets explained in detail we’ll demonstrate how resources are used in general.

The Resource model is the base class for all Datagrowth resources. What follows is pseudo code to demonstrate how the flow with Resource derived classes are used to gather data. The shared Resource API uses the extract method to kick off data collection.

resource = Resource(  # abstract class, instantiate derived class in real code
    config=config  # resources take a Datagrowth config to handle context
)
# the call below kicks-off data collection through the resource
# for HttpResource use extract("get", ...) or extract("post", ...)
# for ShellResource use extract(...)
# input for the data collection can consist of args and kwargs
resource.extract(
    "some", "input",
    session=session
)

if not resource.success:
    # when things go wrong you can inspect status and response content
    print(resource.status, resource.content)
resource.close()  # cleans and saves the resource to cache the collected data

content_type, data = resource.content
if data is not None:
    # handle the data ...

# When making the exact same extract call again.
# This time the data will come from the database as it has been stored before.
resource.extract(
    "some", "input",
    session=session
)

Retrieving the data from the database instead of the actual source is very convenient when dealing with large data sources. It allows for retries without starting over, it keeps resource use low and makes consequent runs much faster.

Http Resource

The HttpResource retrieves data from any HTTP source. Typically these sources are API’s or websites.

Basic usage

The most basic usage for fetching data from a HTTP source is inheriting from URLResource, which in turn inherits from HttpResource.

from collections import Counter

from datagrowth.resources import URLResource


class MyHTTPDataSource(URLResource):
    pass


data_source = MyHTTPDataSource()
data_source.get("https://example.com")
# data_source now contains the response from example.com

# The URLResource is nothing but a thin Django wrapper around the requests library
# You can check if the request succeeded and get the data.
# It will return Python objects for JSON responses or BeautifulSoup instances for HTML and XML
if data_source.success:
    content_type, data = data_source.content

# Resource objects are actually Django models which can be closed to save them to the database
data_source.close()
# Using the Django ORM it is easy to query how requests did
statuses = Counter(
    MyHTTPDataSource.objects.exclude(status=200).values_list("status", flat=True)
)
# And as explained above the database has other advantages like storing data already retrieved.
# The below does not make a request, but fetches results from the database.
# It does this because above we saved a resource to the exact same request
data_source_cache = MyHTTPDataSource().extract("get", "https://example.com")

# Apart from GET you can also do POST.
# Any keyword arguments will be sent as the body of the POST request.
data_source_post = MyHTTPDataSource().extract("post", "https://example.com", example=True)

Downloading files

Usually data is available under the content property. It is also possible to save responses to disk in files. This can be convenient to save images. To do this you use the HttpImageResource:

from datagrowth.resources import HttpImageResource


class MyHTTPImageSource(HttpImageResource):
    pass

image_source = MyHTTPImageSource()
image_source.get("https://example.com/image.jpg")

# The content property will now return the image file instead of data
if image_source.success:
    content_type, image_file = image_source.content

It’s also possible to save other types of files. This can be done by using HttpFileResource instead of HttpImageResource.

Customize requests

In most cases it isn’t sufficient to simply pass URL’s to URLResource or HttpImageResource. For these URL based resources, call get and post directly. By setting some attributes you can customize how any HttpResource fetches data:

from datagrowth.resources import HttpResource


class MyHTTPDataSource(HttpResource):

    URI_TEMPLATE = "https://example.com?query={}"


data_source = MyHTTPDataSource()
# The call below will make a request to https://example.com?query=my-query-terms
data_source.extract("get", "my-query-terms")
print(data_source.request)  # outputs the request being made

The URI_TEMPLATE is the most basic way to declare how resources should be fetched. A more complete example is below. The example is using extract("post", ...), but most attributes also work for extract("get", ...):

from datagrowth.resources import HttpResource


class MyHTTPDataSource(HttpResource):

    URI_TEMPLATE = "https://example.com"

    # Add query parameters to the end of URL's with PARAMETERS
    PARAMETERS = {
        "defaults": 1
    }

    # Or add headers with HEADERS
    HEADERS = {
        "Content-Type": "application/json"
    }

    # As this resource will now be using POST we'll add default data with DATA
    DATA = {
        "default_data": 1
    }

data_source = MyHTTPDataSource()
# The call below makes a POST request to https://example.com?defaults=1
# It will add a JSON content header
# and sends a dictionary with data containing the default_data and more_data keys.
data_source.extract("post", more_data="Yassss")
print(data_source.request)  # outputs the request being made

If you need more control over parameters, headers or data, then you can override the parameters, headers and data methods. These methods by default return the PARAMETERS, HEADERS and DATA attributes. The data method will also merge in any keyword arguments coming from the call to extract("post", ...) if applicable.

Continuation requests

Usually a response also contains some information on how to get more data from the same source. The HttpResource provides a mechanism to easily follow up on requests made by the resource. You’ll have to override the next_parameters method to indicate which data to use for continuation requests.

from datagrowth.resources import HttpResource


class MyHTTPDataSource(HttpResource):

    URI_TEMPLATE = "https://example.com?query={}"

    def next_parameters(self):
        """
        This method looks if there is a "next" key in the response data.
        If there is none it simply returns an empty dict.
        If there is one it returns the value under a "page" key.
        """
        params = super().next_parameters()
        content_type, data = self.content
        if data is None:
            return params
        page = data.get("next", None)
        if page is None:
            return params
        params["page"] = page
        return params


data_source = MyHTTPDataSource()
# The call below will make a request to https://example.com?query=my-query-terms
data_source.extract("get", "my-query-terms")
follow_up = data_source.next()
# The call below will make a request to https://example.com?query=my-query-terms&page=1
# Provided that the response data contains a "next" key with value 1
follow_up.extract("get")

Authenticating requests

Authenticating requests is very similar to other customization of a HttpResource. You need to override the auth_headers or auth_parameters methods. The headers and/or parameters returned by these methods in a dictionary get added to the request, but only when a request is made. This sensitive information is not getting stored in the database. Inside the methods it’s possible to use for instance the config to provide credentials.

Warning

Beware that non-default values for config get stored in plain text in the database. So credentials shouldn’t get passed to a config directly use register_defaults instead (see: register defaults example)

from datagrowth.resources import HttpResource


class MyHTTPDataSource(HttpResource):

    def auth_headers(self):
        return {
            "Authorization": "Bearer {}".format(self.config.api_token)
        }

Shell Resource

The ShellResource retrieves data from a shell command. You’ll need to create a class that inherits from ShellResource and specify a few attributes to make a ShellResource run your commands and gather data.

Specify command

Any command that a ShellResource needs to run gets passed down to the subprocess module. That module accepts commands as a list of strings, where each element is a part of the command. This means that a grep command that finds the string “test” in files in the current directory looks as follows:

command = ["grep", "test", "."]

Let’s assume that we want to gather data by searching for certain strings in certain directories. Let’s further assume that the strings to search for and directories to get data from can very per context. A resource that is capable of gathering such data would look like this:

from datagrowth.resources import ShellResource


class MyGrepDataSource(ShellResource):

    CMD_TEMPLATE = ["grep", "{}", "{}"]


data_source = MyGrepDataSource()
data_source.extract("test", ".")
# data_source now contains lines of text where "test" is found in files in current directory.
# You can call the debug method to see which command was executed exactly.
data_source.debug()  # out: grep "test" .

# The ShellResource is nothing but a thin Django wrapper around the subprocess module
# You can check if the command succeeded and get the data.
# It will return the data as unicode text by default.
if data_source.success:
    content_type, data = data_source.content

# Resource objects are actually Django models which can be closed to save them to the database
data_source.close()

It’s also possible to specify flags. For flags without values you don’t need to do anyting extra like grep's -R flag. Some other flags except values like grep's context. These flags need to be specified in the FLAGS attribute. Furthermore you need to add the CMD_FLAGS element to your CMD_TEMPLATE to indicate where these flags with values should get inserted in the command. Once this is done you can specify the values for the flags through the keyword arguments of extract.

from datagrowth.resources import ShellResource


class MyGrepDataSource(ShellResource):

    CMD_TEMPLATE = [
        "grep",
        "-R",
        "CMD_FLAGS",  # CMD_FLAGS gets replaced by actual flags
        "{}",
        "{}"
    ]

    FLAGS = {  # keys correspond to the kwargs of extract and values to command flags
        "context": "--context="
    }

data_source = MyGrepDataSource()
data_source.extract("test", ".", context=5)
data_source.debug()  # out: grep -R --context=5 test .

Cleaning output

It’s often best to pass through as much data as you can from a Resource. That makes the Resource easier to re-use in different contexts. However when dealing with shell commands the output can be much more than you desire and some cleanup is necessary.

There are two ways to do this:

  1. Override the clean_stdout and/or clean_stderr methods to clean data before storage

  2. Override the transform method to clean data after storage

Using the earlier example cleaning the data could look like this

from datagrowth.resources import ShellResource


class MyGrepDataSource(ShellResource):

    CMD_TEMPLATE = ["grep", "{}", "{}"]

    def clean_stdout(self, stdout):
        out = super().clean_stdout(stdout)
        return out.replace("\r", "\n")

    def clean_stdout(self, stderr):
        err = super().clean_stdout(stderr)
        return err.replace("\r", "\n"

    def transform(self, stdout):
        return stdout.replace("test", "TEST")


data_source = MyGrepDataSource()
data_source.extract("test", ".")
data_source.close()
print(data_source.stdout)  # out: stdout without \r but with "test" in lowercase
print(data_source.stderr)  # out: stderr without \r
content_type, data = data_source.content
print(data)  # out: stdout without \r and with "test" in uppercase

Working directory

The grep command is present globally on most systems. However often you want to retrieve data from a command that is not system wide available. Instead the binary of that command sits somewhere in a directory, where it got installed or compiled. To run such commands you could prefix the command with a full path, but that would make the ShellResource less portable. Alternatively you can set DIRECTORY_SETTING to "shell_resource_bin_dir". When specified the ShellResource will resolve this through Datagrowth configuration and use it as working directory. Set the Django setting DATAGROWTH_SHELL_RESOURCE_BIN_DIR to configure that value.

For example: setting DATAGROWTH_SHELL_RESOURCE_BIN_DIR to "/usr/local/bin" will run the command specified in the ShellResource from the Brew directory. On a Mac that would allow retrieving data from commands like wget or htop when installed through Brew.

Environment

The exact behaviour of commands is often regulated through environment variables. You can specify these for a ShellResource by overriding the environment method. That method receives the input from the extract method and should return a dictionary with key-value pairs that will be used as environment variables or None when no variables should get set. If you only use static variables it’s possible to define those on the VARIABLES attribute. The default environment method returns VARIABLES.

from datagrowth.resources import ShellResource


class MyShellDataSource(ShellResource):

    CMD_TEMPLATE = ["command.sh", "{}"]

    def environment(*args, **kwargs):
        mode = kwargs.pop("mode", None)
        if not mode:
            return
        return {
            "COMMAND_MODE": mode
        }

data_source = MyShellDataSource()
# The call below will execute whatever is in "command.sh" with a COMMAND_MODE set to "foo"
data_source.extract("test", mode="foo")
data_source.debug()  # out: command.sh test

Configuration

You can adjust how a Resource retrieves data by using some configuration options. See the configuration section to learn more on how to set configuration defaults. Here we’ll be explaining the available configurations by setting them directly only. For application wide defaults, use register_defaults.

Caching behaviour

An important aspect about Resource is that it will act as a cache if retrieving data was successful. There are a few configuration options that modify the cache behaviour. All examples below use a namespace of “global”

from example import MyResource

# This configuration disables all cache.
# It still stores the Resource, but it will never get used twice.
MyResource(config={
    "purge_immediately": True
})

# For more fine grained control the purge_after configuration can be used
MyResource(config={
    "purge_after": {
        "days": 30
    }
})
# Such a configuration will indicate to Datagrowth that the Resource
# should not be used as cache after 30 days.
# The value of purge_after can be any dict that gets accepted as kwargs to Python's timedelta.
# This makes it possible to be very flexible about when a Resource
# should not get used anymore, but it won't delete any Resources.
# Datagrowth just doesn't use them as cache after the specified time.

# Sometimes getting data from a Resource is very computation intensive.
# In such cases it might be a good idea to never actually retrieve data
# unless it is cached by a background process.
# By using the cache_only configuration you can force a Resource
# to only return if there is a cached result and to never start real data retrieval.
resource = MyResource(config={
    "cache_only": True
})
resource.extract()  # this never makes a real request

User Agent configuration

This configuration is only useful for HttpResource and child classes. It uses the http_resource namespace

from datagrowth.configuration import create_config
from example import MyResource

http_config = create_config("http_resource", {
    "user_agent": "My custom crawler User Agent"
})

# This configuration sets the user agent for any request made by the Resource.
MyResource(config=http_config)

Backoff Delays configuration

This configuration is only useful for HttpResource and child classes. A HttpResource will sleep for some seconds when a 420, 429, 502, 503 or 504 HTTP error occurred. By default these sleep intervals which give the responding server some rest last 2, 4, 8 and finally 16 seconds. After the final backoff delay interval the HttpResource will error and give up making the request if the server never responds. You can disable or modify this behaviour by setting the backoff_delays configuration It uses the http_resource namespace

from datagrowth.configuration import create_config
from example import MyResource

fast_retry_config = create_config("http_resource", {
    "backoff_delays": [60, 120]
})

# This configuration will let the HttpResource wait 1m and then 2m instead of the default amount of seconds.
minutes_backoff_delay = MyResource(config=fast_retry_config)

no_retry_config = create_config("http_resource", {
    "backoff_delays": []
})

# You can also disable the backoff delay procedure.
no_backoff_delays = MyResource(config=no_retry_config)