Processors¶
Processor (base class)¶
- class datagrowth.processors.base.Processor(config: ConfigurationType | dict[str, Any])¶
This class is the base class for all processors. All processors have a config attribute that contains the configuration for the
Processor.For the rest the base class mainly provides the
create_processorandget_processor_classclass methods. Any class inheriting fromProcessorcan be loaded through these methods by its name. This is useful when you want to transfer theProcessorwithout transferring the actual callables, because most transportation formats (like JSON) don’t support callables.
ExtractProcessor¶
- class datagrowth.processors.input.extraction.ExtractProcessor(config)¶
The
ExtractProcessortakes an objective through its configuration. Using this objective it will extract a list of objects from the input data possibly transforming it.Objectives are dictionaries that require at least an “@” key and one other item. Values in this dictionary can be one of the following:
A JSON path as described by the reach function (for JSON extraction)
A string containing BeautifulSoup expressions using the “soup” and “el” variables (for HTML/XML extraction, not recommended)
A processor name and method name (like: Processor.method) that take a soup and el argument (for HTML/XML extraction, recommended)
These values will be called/parsed to extract data from the input data. The extracted data gets stored under the keys.
The special “@” key indicates where extraction should start and its value should result in a list or generator. By default objective values get evaluated against elements in the list retrieved from the ‘@’ value. Objective items who’s keys start with “#” will get evaluated against the entire input.
The output of the
ExtractProcessorwill typically consist of a list of objects. Each object shares the same keys as the objective except the “@” key. Any keys in the objective that start with “#” will have the same value for all extracted objects, but the “#” will get stripped from the object keys.- extract_from_resource(resource)¶
This is the most common way to extract data with this class. It takes a
Resource(which is a source of data) and tries to extract from it immediately.- Parameters:
resource – (Resource) any resource
- Returns:
(list) extracted objects from the Resource data
- load_objective(objective)¶
Normally an objective is passed to the
ExtractProcessorthrough its configuration. Use this method to load an objective after theExtractProcessorgot initialized.- Parameters:
objective – (dict) the objective to use for extraction
- Returns:
None
- pass_resource_through(resource)¶
Sometimes you want to retrieve data as-is without filtering and/or transforming. This method is a convenience method to do just that for any
Resource. It’s interface is similar toextract_from_resourcein that you can just pass it aResourceand it will return the data from thatResource.- Parameters:
resource – (Resource) any resource
- Returns:
(mixed) the data returned by the resource
- transform(content_type, data)¶
Call this method to start transforming from the input data based on the objective.
If your content_type is not supported by the transformer you could inherit from this class and write your own method. A content type of application/pdf would try to call an
application_pdfmethod on this class passing it the data as an argument. The objective will be available asself.config.objectiveon the instance.- Parameters:
content_type – (content type) The content type of the input data
data – (varies) The input data to transform
- Returns:
(list) transformed objects
- transform_resource(resource)¶
This is the most common way to transform data with this class. It takes a
Resource(which is a source of data) and tries to transform it immediately.- Parameters:
resource – (Resource) any resource
- Returns:
(list) extracted objects from the Resource data
TransformProcessor¶
- class datagrowth.processors.input.transform.TransformProcessor(config)¶
This processor function like the
ExtractProcessor, but has a name that resembles its function a bit better. In the future theExtractProcessorwill be deprecated in favor of thisTransformProcessor.