Processors

Processor (base class)

class datagrowth.processors.base.Processor(config: ConfigurationType | dict[str, Any])

This class is the base class for all processors. All processors have a config attribute that contains the configuration for the Processor.

For the rest the base class mainly provides the create_processor and get_processor_class class methods. Any class inheriting from Processor can be loaded through these methods by its name. This is useful when you want to transfer the Processor without transferring the actual callables, because most transportation formats (like JSON) don’t support callables.

static create_processor(processor_name: str, config: ConfigurationType | dict[str, Any]) ProcessorProtocol

This method will load the Processor class given by name and instantiate it with the given configuration.

static get_processor_class(processor_name: str) type[ProcessorProtocol] | None

This method will load the Processor class given by name and return it. If the Processor does not exist in an installed app it will return None instead

ExtractProcessor

class datagrowth.processors.input.extraction.ExtractProcessor(config)

The ExtractProcessor takes an objective through its configuration. Using this objective it will extract a list of objects from the input data possibly transforming it.

Objectives are dictionaries that require at least an “@” key and one other item. Values in this dictionary can be one of the following:

  • A JSON path as described by the reach function (for JSON extraction)

  • A string containing BeautifulSoup expressions using the “soup” and “el” variables (for HTML/XML extraction, not recommended)

  • A processor name and method name (like: Processor.method) that take a soup and el argument (for HTML/XML extraction, recommended)

These values will be called/parsed to extract data from the input data. The extracted data gets stored under the keys.

The special “@” key indicates where extraction should start and its value should result in a list or generator. By default objective values get evaluated against elements in the list retrieved from the ‘@’ value. Objective items who’s keys start with “#” will get evaluated against the entire input.

The output of the ExtractProcessor will typically consist of a list of objects. Each object shares the same keys as the objective except the “@” key. Any keys in the objective that start with “#” will have the same value for all extracted objects, but the “#” will get stripped from the object keys.

extract_from_resource(resource)

This is the most common way to extract data with this class. It takes a Resource (which is a source of data) and tries to extract from it immediately.

Parameters:

resource – (Resource) any resource

Returns:

(list) extracted objects from the Resource data

load_objective(objective)

Normally an objective is passed to the ExtractProcessor through its configuration. Use this method to load an objective after the ExtractProcessor got initialized.

Parameters:

objective – (dict) the objective to use for extraction

Returns:

None

pass_resource_through(resource)

Sometimes you want to retrieve data as-is without filtering and/or transforming. This method is a convenience method to do just that for any Resource. It’s interface is similar to extract_from_resource in that you can just pass it a Resource and it will return the data from that Resource.

Parameters:

resource – (Resource) any resource

Returns:

(mixed) the data returned by the resource

transform(content_type, data)

Call this method to start transforming from the input data based on the objective.

If your content_type is not supported by the transformer you could inherit from this class and write your own method. A content type of application/pdf would try to call an application_pdf method on this class passing it the data as an argument. The objective will be available as self.config.objective on the instance.

Parameters:
  • content_type – (content type) The content type of the input data

  • data – (varies) The input data to transform

Returns:

(list) transformed objects

transform_resource(resource)

This is the most common way to transform data with this class. It takes a Resource (which is a source of data) and tries to transform it immediately.

Parameters:

resource – (Resource) any resource

Returns:

(list) extracted objects from the Resource data

TransformProcessor

class datagrowth.processors.input.transform.TransformProcessor(config)

This processor function like the ExtractProcessor, but has a name that resembles its function a bit better. In the future the ExtractProcessor will be deprecated in favor of this TransformProcessor.