Processors

A processor is a plain Python class that inherits from the Processor class and that processes data in some way. For instance to do extraction, transformation and loading of data (ETL).

Processors have two important properties:

  • They take configurations during initialisations, which can be used during processing

  • If declarations of processors occur in the processors module of a Django app it’s possible to load a processor by name

These properties make it trivial to dispatch a task with the name of the processor (as a string) and some configuration (as JSON) to for instance Celery and do parallel processing and/or processing in a pipeline fashion.

You can create your own processors or use pre-made processors from a Django app. To start using Datagrowth processors include datagrowth as an app in your INSTALLED_APPS setting

INSTALLED_APPS = (
    ...
    'datagrowth',
)

The first type of processors that Datagrowth ships with is processors that handle input.

Input

Very often when gathering data you’re only interested in part of the data and can discard the rest. The TransformProcessor is a processor that helps you to transform data from common formats like JSON, HTML and XML.

Let’s imagine a scenario where you want to get the name and description of objects in a JSON response that are stored under the results key. Together with this you want to store the source of these objects, which is stored under the source inside a metadata object. It can be useful to store metadata such as a source together with the actual data for later processing.

In order to handle the scenario described above with the TransformProcessor. You would have to write a configuration as follows

from datagrowth.config import create_config
from datagrowth.processors import TransformProcessor


config = create_config("transform_processor", {

    # Objectives indicate what data you want to retrieve from a source
    "objective": {

        "@": "$.results",  # '@' value specifies where to start extraction
        "#source": "$.metadata.source",  # this value is the same for all results

        # Objective items not starting with '@' or '#'
        # are looked up in the context of the '@' item
        "name": "$.name",
        "description: "$.description"
    }

}

transformer = TransformProcessor(config=config)
results = transformer.transform("application/json", """{
    "metadata": {"source": "data tooling" ... more keys you don't need }
    "results": [
        {"name": "datagrowth", description": "data mash up" ... more keys you don't need},
        {"name": "scrappy", description": "website scraping" ... more keys you don't need}
    ]
}""")

print(results)
# out: [{"name": "datagrowth", description": "data mash up", "source": "data tooling"}, ...]

In this case the objective values are JSON paths. These paths point to the data that should get extracted and ignore the rest. Read more about how they work at the reach function documentation

Instead of JSON paths you can use BeautifulSoup expressions to extract from HTML and XML. Let’s imagine a scenario where we want to get data from an unsorted HTML list. The title of each item we want to store as name and the content as description. Lastly the source will come from the title element of the page.

For that the TransformProcessor configuration looks like this

from bs4 import BeautifulSoup

from datagrowth.config import create_config
from datagrowth.processors import TransformProcessor


config = create_config("transform_processor", {

    # Objectives indicate what data you want to retrieve from a source
    "objective": {

        # The objective items below can use the "soup" variable
        # which is the passed BeautifulSoup instance
        "@": "soup.find_all('li')",  # '@' value specifies where to start extraction
        "#source": "soup.find('title').text",  # this value is the same for all results

        # Objective items not starting with '@' or '#'
        # can use the "el" variable which in this case is a BeautifulSoup <li> tag
        # as well as the "soup" variable which is the BeautifulSoup instance
        "name": "el.attrs['title']",  # notice the transformation from 'title' to 'name'
        "description: "el.text"
    }

}

transformer = TransformProcessor(config=config)
soup = BeautifulSoup("""
    <html>
        <head><title>data tooling</title></head>
        <body>
            <ul>
                <li title="datagrowth">data mash up</li>
                <li title="scrappy">website scraping</li>
            <ul>
        </body>
    </html>
""")
results = transformer.transform("application/json", soup)

print(results)
# out: [{"name": "datagrowth", description": "data mash up", "source": "data tooling"}, ...]

In the case above we do a little more than extraction. We also transform a title value into a name value. That way the output of the transformer is interchangeable with the transformer from the JSON scenario. This can be very useful when dealing with multiple different data sources.

Custom made

It can be useful to create a Processor that others can use or it can be cleaner to use a Processor with Datagrowth where you could use other means.

We’ll illustrate how to create your own Processor and use it with a TransformProcessor. You’ll see that when extracting HTML or XML it’s a much cleaner method than using BeautifulSoup strings.

First we’ll show you how an extraction configuration could look when using a custom processor

config = create_config("transform_processor", {

    # Objectives indicate what data you want to retrieve from a source
    "objective": {

        # The DataToolingExtractor is a class we'll define later
        # It will have methods like get_entries and get_source
        "@": "DataToolingExtractor.get_entries",
        "#source": "DataToolingExtractor.get_source",
        "name": "DataToolingExtractor.get_name",
        "description: "DataToolingExtractor.get_description"
    }

}

The big advantage of using a processor over embedding functions in your objective directly is that you’ll be able to serialize the objective to for instance JSON and transfer it over the wire.

To create the DataToolingExtractor that we can use with the TransformProcessor. You’ll need to create a processors module and place it inside of an installed Django app.

Inside that module you can create processors by inheriting from the Processor class

from datagrowth.processors import Processor


class DataToolingExtractor(Processor):

    @classmethod
    def get_entries(cls, soup):
        return soup.find_all('li')

    @classmethod
    def get_source(cls, soup):
        title = soup.find('title')
        return title.text if title else None

    @classmethod
    def get_name(cls, soup, el):
        return el.attrs.get("title", None)

    @classmethod
    def get_description(cls, soup, el):
        return el.text

You can also use processors from your own code. There are class methods on Processor called create_processor and get_processor_class that can help you do that. Read more about them in the Processor reference