Resources

Resource (base class)

class datagrowth.resources.base.Resource(*args, **kwargs)

This class defines the interface that all resources adhere to. You’ll rarely extend this class directly. The HttpResource and ShellResource are examples of classes that overextend this class.

clean()

Hook for doing any extra model-wide validation after clean() has been called on every field by self.clean_fields. Any ValidationError raised by this method will not be associated with a particular field; it will have a special-case association with the field defined by NON_FIELD_ERRORS.

close()

This convenience method handles both the clean and save step for saving models. To make use of the resource cache it’s necessary to clean before saving and close handles this directly.

property content

This method typically gets overwritten for different resource types. It should return the content_type and data from the resource.

Returns:

content_type, data

classmethod get_name()

Return the name of the resource. This is the model_name for almost all resources.

Returns:

(str) lowercase model name

classmethod get_queue_name()

Returns the queue name that background tasks should dispatch to. By default it returns the default Django Celery queue name.

Returns:

(str) queue name

handle_errors()

Overwrite this method to handle resource specific error cases. Usually you’d raise a particular DGResourceException to indicate particular errors.

next()

Creates a new Resource that is the follow-up of the current Resource, like the Resource for a next page in a Resource that supports pagination. Or returns None if no such follow-up exists (the default).

retain(retainer)

Links any Django model unto a GenericRelation upon a resource. Any resources retained this way will not get deleted from cache. This is convenient to save any context that can help during debugging.

Parameters:

retainer – (model) the model retaining the resource

property success

This method typically gets overwritten for different resource types. It should indicate the success of the data gathering.

Returns:

(bool)

variables(*args)

Maps the input arguments from a resource to a dictionary. This makes it easy to access the positional input variables under names. Overwrite this method to create the mapping for your particular resource.

Parameters:

args – (tuple) the positional arguments given as input to the resource

Returns:

(dict) a dictionary with the input variables as values

Http

class datagrowth.resources.http.generic.HttpResource(*args, **kwargs)

You can extend from this base class to declare a Resource that gathers data from a HTTP(S) source. For instance websites and (REST)API’s

This class is a wrapper around the requests library and provides:

  • easy follow up of continuation URL’s in responses

  • cached responses when retrieving data a second time

  • handle authentication through setting headers or GET parameters without storing credentials

  • slowing down requests when servers give errors or warnings related to high-load

Response headers, body and status get stored in the database as well as an abstraction of the request. Any authentication data gets stripped before storage in the database. Override handle_errors method to customize how errors in responses are detected.

auth_headers()

Returns the dictionary that should be used as authentication headers for the request the resource will make. Override this method in your own class to add authentication. By default this method returns an empty dictionary meaning there are no authentication headers.

Returns:

(dict) dictionary with headers to add to requests

auth_parameters()

Returns the dictionary that should be used as authentication parameters for the request the resource will make. Override this method in your own class to add authentication. By default this method returns an empty dictionary meaning there are no authentication parameters.

Returns:

(dict) dictionary with parameters to add to requests

clean()

Hook for doing any extra model-wide validation after clean() has been called on every field by self.clean_fields. Any ValidationError raised by this method will not be associated with a particular field; it will have a special-case association with the field defined by NON_FIELD_ERRORS.

property content

After a successful get or post call this method reads the ContentType header from the HTTP response. Depending on the MIME type it will return the content type and the parsed data.

  • For a ContentType of application/json data will be a python structure

  • For a ContentType of text/html or text/xml data will be a BeautifulSoup instance

Any other ContentType will result in None. You are encouraged to overextend HttpResource to handle your own data types.

Returns:

content_type, data

create_next_request()

Creates and returns a dictionary that represents a continuation request. Often a source will indicate how to continue gather more data. By overriding the next_parameters developers can indicate how continuation requests can be made. Calling this method will build a new request using these parameters.

Returns:

(dict) a dictionary representing a continuation request to be made

data(**kwargs)

Returns the dictionary that will be used as HTTP body for the request the resource will make. By default this is the dictionary from the DATA attribute updated with the kwargs from the input from the send method.

Parameters:

kwargs – keyword arguments from the input

Returns:

get(*args, **kwargs)

This method calls send with “get” as a method. See the send method for more information.

Parameters:
  • args – arguments that will get merged into the URI_TEMPLATE

  • kwargs – keywords arguments that will get send as data

Returns:

HttpResource

handle_errors()

Raises exceptions upon error statuses Override this method to raise exceptions for your own error states. By default it raises the DGHttpError40X and DGHttpError50X exceptions for statuses.

static hash_from_data(data)

Given a dictionary will recursively sort and JSON dump the keys and values of that dictionary. The end result is given to SHA-1 to create a hash, that is unique for that data. This hash can be used for a database lookup to find earlier requests that send the same data.

Parameters:

data – (dict) a dictionary of the data to be hashed

Returns:

the hash of the data

headers(*args, **kwargs)

Returns the dictionary that should be used as headers for the request the resource will make. By default this is the dictionary from the HEADERS attribute.

Parameters:
  • args – keyword arguments from the input (ignored by default)

  • kwargs – keyword arguments from the input (ignored by default)

Returns:

(dict) a dictionary representing HTTP headers

next() Self | None

Creates a new Resource that is the follow-up of the current Resource, like the Resource for a next page in a Resource that supports pagination. Or returns None if no such follow-up exists (the default).

next_parameters()

Returns the dictionary that should be used as HTTP query parameters for the continuation request a resource can make. By default this is an empty dictionary. Override this method and return the correct parameters based on the content of the resource.

Returns:

(dict) a dictionary representing HTTP continuation query parameters

parameters(**kwargs)

Returns the dictionary that should be used as HTTP query parameters for the request the resource will make. By default this is the dictionary from the PARAMETERS attribute.

You may need to override this method. It will receive the return value of the variables method as kwargs.

Parameters:

kwargs – variables returned by the variables method (ignored by default)

Returns:

(dict) a dictionary representing HTTP query parameters

static parse_content_type(content_type, default_encoding='utf-8')

Given a HTTP ContentType header will return the mime type and the encoding. If no encoding is found the default encoding gets returned.

Parameters:
  • content_type – (str) the HTTP ContentType header

  • default_encoding – (str) the default encoding when

Returns:

mime_type, encoding

patch(*args, **kwargs)

This method calls send with “patch” as a method. See the send method for more information.

Parameters:
  • args – arguments that will get merged into the URI_TEMPLATE

  • kwargs – keywords arguments that will get send as data

Returns:

HttpResource

post(*args, **kwargs)

This method calls send with “post” as a method. See the send method for more information.

Parameters:
  • args – arguments that will get merged into the URI_TEMPLATE

  • kwargs – keywords arguments that will get send as data

Returns:

HttpResource

put(*args, **kwargs)

This method calls send with “put” as a method. See the send method for more information.

Parameters:
  • args – arguments that will get merged into the URI_TEMPLATE

  • kwargs – keywords arguments that will get send as data

Returns:

HttpResource

request_with_auth()

Get the request that this resource will make with authentication headers and parameters added. Override auth_headers and/or auth_parameters to provide the headers and/or parameters.

Returns:

(dict) a copy of the request dictionary with authentication added

request_without_auth()

Get the request that this resource will make with authentication headers and parameters from auth_headers and auth_parameters removed.

Returns:

(dict) a copy of the request dictionary with authentication removed

send(method, *args, **kwargs)

This method handles the gathering of data and updating the model based on the resource configuration. If the data has been retrieved before it will load the data from cache instead. Specify cache_only in your config if you want to prevent any HTTP requests. The data might be missing in that case.

You must specify the method that the resource will be using to get the data. Currently this can be the “get” and “post” HTTP verbs.

Any arguments will be passed to URI_TEMPLATE to format it. Any keyword arguments will be passed as a data dict to the request. If a keyword is listed in the FILE_DATA_KEYS attribute on a HttpResource, then the value of that argument is expected to be a file path relative to DATAGROWTH_WEB_MEDIA_ROOT. The value of that keyword will be replaced with the file before making the request.

Parameters:
  • method – “get” or “post” depending on which request you want your resource to execute

  • args – arguments that will get merged into the URI_TEMPLATE

  • kwargs – keywords arguments that will get send as data

Returns:

HttpResource

set_error(status, connection_error=False)

Sets the given status on the HttpResource. When dealing with connection_errors it sets valid defaults.

Parameters:
  • status – (int) the error status from the response

  • connection_error – (bool) whether the error occurred during a connection error

Returns:

property success

Returns True if status is within HTTP success range

Returns:

Boolean

static uri_from_url(url)

Given a URL this method will strip the protocol and sort the parameters. That way a database lookup for a URL will always return URL’s that logically match that URL.

Parameters:

url – the URL to normalize to URI

Returns:

a normalized URI suitable for lookups

validate_request(request, validate_input=True)

Validates a dictionary that represents a request that the resource will make. Currently it checks the method, which should be “get” or “post” and whether the current data (if any) is still valid or has expired. Apart from that it validates input which should adhere to the JSON schema defined in the GET_SCHEMA or POST_SCHEMA attributes

Parameters:
  • request – (dict) the request dictionary

  • validate_input – (bool) whether to validate input

Returns:

variables(*args)

Parsers the input variables and returns a dictionary with a “url” key. This key contains a list of variables that will be used to format the URI_TEMPLATE.

Returns:

(dict) a dictionary where the input variables are available under names

class datagrowth.resources.http.generic.URLResource(*args, **kwargs)

Sometimes you don’t want to build a URI through the URI_TEMPLATE, because you have a URL, where data should be retrieved from immediately. For this use case the URLResource is very suitable. Just pass the URL as a first argument to either get or post and the request will be made.

Only full URL’s with protocol are excepted as an argument. And note that it is not possible to adjust the parameters through the parameters method, because it is assumed that all parameters are part of the URL given to get or post.

PARAMETERS = None
class datagrowth.resources.http.files.HttpFileResource(*args, **kwargs)

Sometimes you want to download a file instead of storing the content in the database. For this use case the HttpFileResource is very suitable. Just pass the URL as a first argument to get and the URL will be downloaded as a file, storing it in your MEDIA_ROOT.

The file path of the downloaded file will get stored in the body field. This path will be relative to the MEDIA_ROOT. The path will include a downloads folder and a subfolder that is the app_name of the concrete class. Under that directory there are many possible subdirectories in the form of “x/yz/”. Where x, y and z will be hexidecimal characters. Creating these subdirectories is necessary to prevent huge download directories, that would hamper performance.

Only full URL’s with protocol will get downloaded. Any URL’s without a protocol will get stored as a failure with a 404 (Not Found) error code. Please note that with this class it is not possible to adjust the parameters through the parameters method, because it is assumed that all parameters are part of the URL given to get.

property content

Opens the file stored at the file path in body and returns that file together with the content type.

Returns:

content_type, file

static get_file_name(original, now)

Override this method to change the file naming convention. By default it will take the filename from the URL and prefix it with a datetime string of the date and time at downloading.

Parameters:
  • original – (str) the URL file name

  • now – (datetime) a datetime object to use as prefix input

Returns:

post(*args, **kwargs)

This method calls send with “post” as a method. See the send method for more information.

Parameters:
  • args – arguments that will get merged into the URI_TEMPLATE

  • kwargs – keywords arguments that will get send as data

Returns:

HttpResource

transform(file)

By default the content property will return the file wrapped in a Django File class. It may be convenient to wrap it in some other way. Override this method and return the file in a different format to change the content return value.

Parameters:

file – (File) the file read from storage

Returns:

(any) file in correct format

class datagrowth.resources.http.files.HttpImageResource(*args, **kwargs)

This class acts like the HttpFileResource with the only difference that it will return content as Pillow images.

transform(file)

By default the content property will return the file wrapped in a Django File class. It may be convenient to wrap it in some other way. Override this method and return the file in a different format to change the content return value.

Parameters:

file – (File) the file read from storage

Returns:

(any) file in correct format

datagrowth.resources.http.files.file_resource_delete_handler(sender, instance, **kwargs)

A Django signal handler that can be bound to a post_delete signal to free disk space when file resources get deleted.

Parameters:
  • sender – receives the class that is sending the signal

  • instance – the object under deletion

  • kwargs – ignored, for compatibility only

Shell

class datagrowth.resources.shell.ShellResource(*args, **kwargs)

You can extend from this base class to declare a Resource that gathers data from a any shell command.

This class is a wrapper around the subprocess module and provides:

  • cached responses when retrieving data a second time

The resource stores the stdin, stdout and stderr from commands in the database as well as an abstraction of the command.

clean()

Hook for doing any extra model-wide validation after clean() has been called on every field by self.clean_fields. Any ValidationError raised by this method will not be associated with a particular field; it will have a special-case association with the field defined by NON_FIELD_ERRORS.

clean_stderr(stderr)

This method decodes the stderr from the subprocess result to UTF-8. Override this method to do any further cleanup.

Parameters:

stderr – (bytes) stderr from the command

Returns:

(str) cleaned decoded output

clean_stdout(stdout)

This method decodes the stdout from the subprocess result to UTF-8. Override this method to do any further cleanup.

Parameters:

stdout – (bytes) stdout from the command

Returns:

(str) cleaned decoded output

property content

After a successful run call this method passes stdout from the command through the transform method. It then returns the value of the CONTENT_TYPE attribute as content type and whatever transform returns as data

Returns:

content_type, data

debug()

A method that prints to stdout the command that will get executed by the ShellResource. This is mostly useful for debugging during development.

environment(*args, **kwargs)

You can specify environment variables for the command based on the input to run by overriding this method. The input from run is passed down to this method, based on this a dictionary should get returned containing the environment variables or None if no environment should be set.

By default this method returns the VARIABLES attribute without making changes to it.

Parameters:
  • args – arguments from the run command

  • kwargs – keyword arguments from the run command

Returns:

a dictionary with environment variables or None

handle_errors()

Raises exceptions upon error statuses Override this method to raise exceptions for your own error states. By default it raises the DGShellError for any status other than 0.

run(*args, **kwargs)

This method handles the gathering of data and updating the model based on the resource configuration. If the data has been retrieved before it will load the data from cache instead. Specify cache_only in your config if you want to prevent any execution of commands. The data might be missing in that case.

Any arguments will be passed to CMD_TEMPLATE to format it. Any keyword arguments will be parsed into command flags by using the FLAGS attribute. The parsed flags will be inserted into CMD_TEMPLATE where ever the CMD_FLAGS value is present.

Parameters:
  • args – get passed on to the command

  • kwargs – get parsed into flags before being passed on to the command

Returns:

self

property success

Returns True if exit code is 0 and there is some stdout

transform(stdout)

Override this method for particular commands. It takes the stdout from the command and transforms it into useful output for other components. One use case could be to clean out log lines from the output.

Parameters:

stdout – the stdout from the command

Returns:

transformed stdout

static uri_from_cmd(cmd)

Given a command list this method will sort that list, but keeps the first element as first element. That way a database lookup for a command will always return a command that logically match that command. Regardless of flag or argument order. At the same time similar commands will appear beneath each other in an overview.

Parameters:

cmd – the command list as passed to subprocess.run to normalize to URI

Returns:

a normalized URI suitable for lookups

validate_command(command, validate_input=True)

Validates a dictionary that represents a command that the resource will run.

It currently checks whether the current data (if any) is still valid or has expired. Apart from that it validates input which should adhere to the JSON schema defined in the SCHEMA attribute.

Parameters:
  • command – (dict) the command dictionary

  • validate_input – (bool) whether to validate input

Returns:

variables(*args)

Parsers the input variables and returns a dictionary with an “input” key. This key contains a list of variables that will be used to format the CMD_TEMPLATE.

Returns:

(dict) a dictionary where the input variables are available under names