Skip to main content

Python on the Nuix Engine Part 1: Integrating Python into Your Nuix Workflow

Python code on a computer screen

Nuix Workstation and the Nuix Engine provide multiple ways to run your own code, allowing you to customize your workflow, apply custom tasks or do things that aren’t built into the platform (yet). In this series of blog posts, I will explain how to call Python applications from inside the Nuix Engine. You can access all the code discussed in this blog post from our GitHub.

Getting Familiar with Scripting

For a general introduction to scripting with Nuix, read our blog posts on Getting Started with Scripting on the Nuix Engine, Part 1 and Part 2. This blog post links to a recorded webinar about Worker Side Scripting. And this one uses Python as the scripting language to integrate with Kafka for a specific example workflow. Finally, this post talks about using scripts to interact with other APIs and uses Python as the language of choice. That last one also has a link to a GitHub Repository with sample code, so you can get a head start with some working code to reference.

Jython

Nuix allows you to run Python code inside the Nuix Engine as one of its scripting languages – employing a flavor of Python called Jython: Python on the Java Virtual Machine.

It’s important to understand that Jython is not the traditional CPython – it doesn’t have access to all the libraries traditionally available to Python. Also, Jython is only compatible with Python 2.7 so it doesn’t have access to a lot of the Python syntax and core features available since the big 3.0 release.

These issues are not without solutions: If your preferred Python library is not supported in Jython there is a good chance you can substitute a Java library for it, and Nuix may already ship the library you need. (You will see an example of that in our code below.)

Using Java APIs instead of Python ones is workable but can get harder to do when your environment gets more complicated. The more libraries you use that are incompatible with Jython or that require modern Python features, the better off you would be exploring alternative means of using Python with the Nuix Engine. These alternatives include:

  1. Running Python standalone as a command line triggered from inside the Nuix Engine
  2. Running Python in an external microservice and connecting to that service from inside the Nuix Engine
  3. Inverting the paradigm: calling the Nuix Engine from inside your Python application
  4. Calling into the Nuix Engine from a Python application using the RESTful API.

Each option has its own use cases. They all use external Python environments to give you access to the Python libraries you need. The first couple of options rely on scripts that run inside the Nuix Engine (in-Engine scripts) communicating with the external Python environment, while the other two use the external Python environment as the driver to call into the Nuix Engine. This post will focus on the first two options: controlling an external Python application from inside Nuix, either by running it in a command line or using a microservice. Part 2 of this series will discuss the second two options.

Command Line

A traditional method of running a Python application is from the command line:

> python my_app.py --say Hello --to “Inspector Gadget”

You can use the same approach in Nuix, taking advantage of the subprocess module to execute the command line. Conceptually, you would design the application to have a script that runs inside the Nuix Engine or Nuix Workstation (an in-Engine script) that collects what is needed from the case, formats it so the external application can use it, perhaps writing it to disk, triggers the external Python application, collects its results and updates the case data based on the results.

You can use the following code to call a command line application from the Nuix Workstation scripting console:

import os
from subprocess import Popen, PIPE

python_project_path = r'C:\Projects\Python\Python-On-The-Engine'
predict_script = r'cli\predict_from_folder.py'
path_to_images = r'C:\Projects\RestData\Exports\temp'

python_script = os.path.join(python_project_path, predict_script)
cmd_args = ['python.exe', python_script, path_to_images]
predict_process = Popen(cmd_args, stdout=PIPE, 
universal_newlines=True, shell=True)

For these examples, please refer to the GitHub repository for the full code. It has several Python packages, of which we will use two here. The first is the img_classifier package which holds a basic machine learning image classification application whose environment requires we run it in CPython and so cannot run in the Nuix Engine (the GitHub repository’s ReadMe goes into more detail). In this example, we choose to run it from a command line.

The second package used in this example is the cli package in the repository. It has two files, the cli.predict_selected.py is the in-Engine Python code we will use for demonstration here. The cli.predict_from_folder.py file is a command line access application for the package. As an external application we will discuss what it does but not show its code.

The in-Engine code shown above uses the standard Python library’s subprocess package to open a subprocess to run the command line application. Specifically, it uses the Popen class to create a new process. We can follow the state of the process by reading its stdout, which we have access to because we PIPEed it into our Python application.

return_code = None
while return_code is None:
    return_code = prediction_process.poll()
    if return_code is None:
        output = prediction_process.stdout.readline()
        print(output)
    else:
        print('Return Code: ' + str(return_code))
        output = prediction_process.stdout.readlines()
        print(output)

NOTE: It’s best practice to parse both the output streams from the external process and use a more structured output to read the results. If the external application writes to an output stream and nothing reads it then the output stream could fill up causing the application to hang.

Reading the output like this gets the results of any work the external application creates, but it isn’t very structured. Another method would be to have the external application write results out to a file (or files), then monitor those files in the in-Engine script for progress and results. The example below uses a JSON file to transport results. The external application will write the JSON file in the format:

{
  "status": {
    "done": <True|False>,
    "progress": <%complete>,
    "current_item": <index of current item>,
    "total": <count of total items>,
    "errors": []
  },
  "results": {
    "<item GUID>": [{"<classification1>": <score1>}, ...]
    ...
  }
}

While the application runs, it updates the status in the JSON and when it’s complete, it writes out the results. So our in-Engine code has to read the status to learn progress, then use the results section to get the data it needs to do work on the case items:

import json
import time
results_poll_time = # in seconds
results_json_filename = 'inference.json'

results_file = os.path.join(path_to_results, results_json_filename)
while not os.path.exists(results_file):
    # File not made yet, keep trying
    time.sleep(results_poll_time)
done = False

while not done:
    with open(results_file, 'r') as status_file:
        try:
            status = json.load(status_file)

            done = status['status']['done']
            if not done:
                print('Progress: ' + str(status['status']['progress']) + '% [' + 
                      str(status['status']['current_item']) +
                      '/' + str(status['status']['total']) + ']')
        except ValueError as err:
            # Don't care, this happens when the file is written at same time 
            # as reading, just skip and continue
            pass

    time.sleep(results_poll_time)
print('Finished prediction')

with open(os.path.join(results_path, results_json_filename), 'r') as results_file:
    process_results_data = json.load(results_file)
    prediction_results = process_results_data['results']

    for image_results in prediction_results.items():
        # Do something with the prediction results

Add a little clean up code, comments and testing and we would be just about done. There is a problem though: This relies on Python being installed and accessible from the command line – and specifically the correct Python environment to host our desired application being on the executable path, with all the required libraries available at execution time. This could be true if there is one Python environment but is a lot less likely if you have multiple Python environments, use Anaconda or Python virtual environments, or want to ship a specific and controlled Python environment to run on. To handle that, we would want to control the operating system’s environment variables to ensure the environment we want is the one that gets executed. To do that, we start the script with the following environment configuration:

python_env_path = r'C:\Projects\Python\Python-On-The-Engine\env'

def initialize_environment():
    python_path = python_project_path + ';' + python_env_path + ';' + \
        python_env_path + r'\Lib;' + python_env_path + r'\DLLs;' + \
        python_env_path + r'\Library\user\bin;' + \
        python_env_path + r'\Library\bin;' + python_env_path + r'\bin'
    path = python_path + ";" + os.getenv('PATH')

    os.environ['PYTHONPATH'] = python_path
    os.environ['PATH'] = path

initialize_environment()

Note: We put the python_path at the start of the PATH environment variable so our desired environment pre-empts any Python which may already be configured on the path.

The above configuration is a full environment configuration for an Anaconda environment installed in C:\Projects\Python\Python-On-The-Engine\env to run the external Python application in. You can find the environment.yml file in the repository to copy the environment on your system.

For a complete listing of the in-Engine code used in this example see the cli.predict_selected.py file in the GitHub repository, and the repository’s ReadMe for details on how to run it. Essentially the instructions to run are:

  • Select some JPEG files in an open case in Nuix Workstation
  • Copy the contents of cli.predict_selected.py into the Scripting Console
  • Modify the paths to match your system
  • Press Execute

Conceptually, this is a simple approach, and the external Python application can run with little or no changes from how it might naturally run without the Nuix Engine involved. But it gets a bit more complicated on the in-Engine side of things: we need to ensure we use the correct Python environment to run the application and we need to know how to monitor the application, parse the results and align it with the items that need to be worked on.

Microservice

An alternative approach to calling an external application from the Nuix Engine would be to connect to a running instance of the Python application using Sockets or Shared Memory, for example. There are benefits here: You can start the Python application before connecting to it from the Nuix Engine, which removes the responsibility of configuring the environment off the in-Engine script. You can also keep the application running reducing the start and stop time for initializing the external Python environment.

In the command line example, we processed all the images at once to minimize the number of times the environment had to start. One of the costs of that approach is having to parse through a results file to get the relevant data for each item. If we could instead:

  • keep the application running
  • pass one item in at a time
  • have the external application work on that one item then return a response,

we would know exactly which item the results belong to – we wouldn’t have to search for it, or sift through a large results stack to get the correct output.

A common way to approach having an independent long-running application that you want to interact with is to use a microservice and to interact with that service using a REST-like API.

This brings us to our second example, which you can find in the microservice package: microservice.predict_selected.py and microservice.predict_service.py files. Like our command line example, we use the img_classifier package to run a simple image classification. This time we use the microservice.predict_service.py module to run it via a Flask application. (You can read the repository’s ReadMe to learn how to run it and read the code to see how it works.) For this article, all you need to know is that it exposes a REST-like interface with a /predict/{item GUID} endpoint we will use to predict on a specific image.

The application design is to use an in-Engine script to:

  • Loop through all the images one at a time
  • Make the POST request to the correct endpoint
  • Send the image via a Multipart Form file upload
  • Read the results back from the response to directly store in the image’s metadata.

In this example we will see how to use the in-Engine script to communicate with the Flask Python application using Apache’s HttpClient to make the REST calls. You shouldn’t need to install anything new: the HttpClient Java library is already shipped in the Nuix environment and the in-Engine script has full access to the Nuix Java environment.

We will start laying down a framework to make the rest of the application easier to use. At the heart of a REST client is making requests to the server. The base level request in our application is handled in this code:

import json
from java.nio.charset import Charset
from org.apache.http.impl.client import HttpClients
from org.apache.http.client.methods import RequestBuilder
from org.apache.http.util import EntityUtils

utf8 = Charset.forName('UTF-8')

def do_request(http_request):
    http_client = HttpClients.createDefault()

    try:
        response = http_client.execute(http_request)

        status_code = response.getStatusLine().getStatusCode()
        response_body = EntityUtils.toString(response.getEntity(), utf8)

        if status_code < 300 and response_body is not None:
            return status_code, json.loads(response_body)
        else:
            return status_code, response_body
    finally:
        http_client.close()

The parameter to do_request is an org.apache.http.client.methods.HttpUriRequest object. After the request it reads the response, returning the status code and a dictionary created from the response body’s JSON. We use get(...) and post(...) methods to generate those HttpUriRequest objects and then call this method to execute the request. Since the get(...) and post(...) methods do little more than formatting the URL and re-constructing the return types I won’t show the code; you can see it in the GitHub repository.

The way the in-Engine script works is to loop through all the selected items and then run the prediction for each one individually. The prediction looks like this:

from org.apache.http.entity.mime import MultipartEntityBuilder, HttpMultipartMode
from org.apache.http.entity import ContentType
HOST = 'http://127.0.0.1:8982'

item_guid = item.getGuid()
item_filename = item.getLocalisedName()

image_data = item.getBinary().getBinaryData().getInputStream().readAllBytes()

request_body = MultipartEntityBuilder.create() \
    .setMode(HttpMultipartMode.BROWSER_COMPATIBLE) \
    .addBinaryBody(item_guid, image_data, ContentType.DEFAULT_BINARY, item_filename) \
    .build()
success, response = post(HOST, ['predict', item_guid], body=request_body)

We get some data we need to make the request – such as the item’s GUID and name. We read the item’s binary data, then build a MultipartEntity – the HttpClient class for a Multipart Form file upload – and use that to make a POST request to the microservice’s /predict endpoint. The response will have the prediction results for the image, so we can go straight to parsing and setting the results to the item’s custom metadata:

predictions = prediction['results'][item.getGuid()]
prediction_data = ';'.join([list(pred.items())[0][0] + ':' +
                            str(round(float(list(pred.items())[0][1]) * 100, 2)) + '%'
                            for pred in predictions]
                           )
item_custom_metadata = item.getCustomMetadata()
item_custom_metadata['image_classifier_top3'] = prediction_data

That is most of the code you need! There is some code for formatting URLs, looping through items and limiting to just JPEGs, but not much else. There is a lot less configuration needed – just the HOST variable to point to the address and port where the microservice runs. The real complexity is pushed off to how to launch the microservice. Running a Flask application isn’t necessarily tough – read the repository’s ReadMe for details on how to configure and run it. Before going into any sort of production environment you should read more about how Flask applications are configured to ensure it is deployed securely for your environment.

Summary

In this article we learned how to call about using Python applications from inside the Nuix Engine.

We briefly talked about using Jython in Nuix Workstation and then provided details on how to use the in-Engine scripting to talk to an external Python application running in its own environment. We discussed two methods: running a Python application from the command line and connecting to a Python microservice.

Using the Python command line application is straightforward and requires minimum amount of re-configuring the source application. It is best suited for large batches of work, and especially if you have an already functional standalone application you want to integrate into your Nuix workflow.

Using a microservice allows you to connect to a long-running process and is best used when your Python application has a high start-up cost, or when you need to run the application often, and quicker. It can simplify the work needed for in-Engine scripting, while requiring a bit more work on the external Python application to get it ready. It would also allow you to run the Python application on a different host to optimize resource use between Nuix and the application.

This isn’t the complete story. The next post will discuss how to incorporate the Nuix Engine into your own Python application and workflow.

Additional Resources

The GitHub repository with the code used in this blog:

Other blog posts:

Other examples in GitHub:

Documentation hub:

Downloadable documentation: