Blogs

The Nuix Engine: Integrating Python into Your Nuix Workflow

Written by Steven Luke

 

Nuix Workstation and the Nuix Engine provide multiple ways to run your own code, allowing you to customize your workflow, apply custom tasks or do things that aren’t built into the platform (yet). You can access all the code discussed in this blog post from our GitHub.

 

Jython

Nuix allows you to run Python code inside the Nuix Engine as one of its scripting languages – employing a flavor of Python called Jython: Python on the Java Virtual Machine.

It’s important to understand that Jython is not the traditional CPython – it doesn’t have access to all the libraries traditionally available to Python. Also, Jython is only compatible with Python 2.7 so it doesn’t have access to a lot of the Python syntax and core features available since the big 3.0 release.

These issues are not without solutions: If your preferred Python library is not supported in Jython there is a good chance you can substitute a Java library for it, and Nuix may already ship the library you need. (You will see an example of that in our code below.)

Using Java APIs instead of Python ones is workable but can get harder to do when your environment gets more complicated. The more libraries you use that are incompatible with Jython or that require modern Python features, the better off you would be exploring alternative means of using Python with the Nuix Engine. These alternatives include:

  1. Running Python standalone as a command line triggered from inside the Nuix Engine
  2. Running Python in an external microservice and connecting to that service from inside the Nuix Engine
  3. Inverting the paradigm: calling the Nuix Engine from inside your Python application
  4. Calling into the Nuix Engine from a Python application using the RESTful API.

Each option has its own use cases. They all use external Python environments to give you access to the Python libraries you need. The first couple of options relies on scripts that run inside the Nuix Engine (in-Engine scripts) communicating with the external Python environment, while the other two use the external Python environment as the driver to call into the Nuix Engine. This post will focus on the first two options: controlling an external Python application from inside Nuix, either by running it in a command line or using a microservice. Part 2 of this series will discuss the second two options.

 

Command Line

A traditional method of running a Python application is from the command line:

> python my_app.py --say Hello --to “Inspector Gadget”

You can use the same approach in Nuix, taking advantage of the subprocess module to execute the command line. Conceptually, you would design the application to have a script that runs inside the Nuix Engine or Nuix Workstation (an in-Engine script) that collects what is needed from the case, formats it so the external application can use it, perhaps writing it to disk, triggers the external Python application, collects its results and updates the case data based on the results.

You can use the following code to call a command line application from the Nuix Workstation scripting console:

import os
from subprocess import Popen, PIPE

python_project_path = r'C:\Projects\Python\Python-On-The-Engine'
predict_script = r'cli\predict_from_folder.py'
path_to_images = r'C:\Projects\RestData\Exports\temp'

python_script = os.path.join(python_project_path, predict_script)
cmd_args = ['python.exe', python_script, path_to_images]
predict_process = Popen(cmd_args, stdout=PIPE, 
universal_newlines=True, shell=True)

For these examples, please refer to the GitHub repository for the full code. It has several Python packages, of which we will use two here. The first is the img_classifier the package which holds a basic machine learning image classification application whose environment requires we run it in CPython and so cannot run in the Nuix Engine (the GitHub repository’s ReadMe goes into more detail). In this example, we choose to run it from a command line.

The second package used in this example is the cli package in the repository. It has two files, the cli.predict_selected.py is the in-Engine Python code we will use for demonstration here. The cli.predict_from_folder.py file is a command line access application for the package. As an external application, we will discuss what it does but not show its code.

The in-Engine code shown above uses the standard Python library’s subprocess package to open a subprocess to run the command line application. Specifically, it uses the Popen class to create a new process. We can follow the state of the process by reading its stdout, which we have access to because we PIPEed it to our Python application.

return_code = None
while return_code is None:
    return_code = prediction_process.poll()
    if return_code is None:
        output = prediction_process.stdout.readline()
        print(output)
    else:
        print('Return Code: ' + str(return_code))
        output = prediction_process.stdout.readlines()
        print(output)

NOTE: It’s best practice to parse both the output streams from the external process and use a more structured output to read the results. If the external application writes to an output stream and nothing reads it then the output stream could fill up causing the application to hang.

Reading the output like this gets the results of any work the external application creates, but it isn’t very structured. Another method would be to have the external application write results out to a file (or files), then monitor those files in the in-Engine script for progress and results. The example below uses a JSON file to transport results. The external application will write the JSON file in the format:

{
  "status": {
    "done": <True|False>,
    "progress": <%complete>,
    "current_item": <index of current item>,
    "total": <count of total items>,
    "errors": []
  },
  "results": {
    "<item GUID>": [{"<classification1>": <score1>}, ...]
    ...
  }
}

While the application runs, it updates the status in the JSON and when it’s complete, it writes out the results. So our in-Engine code has to read the status to learn progress, then use the results section to get the data it needs to do work on the case items:

import json
import time
results_poll_time = # in seconds
results_json_filename = 'inference.json'

results_file = os.path.join(path_to_results, results_json_filename)
while not os.path.exists(results_file):
    # File not made yet, keep trying
    time.sleep(results_poll_time)
done = False

while not done:
    with open(results_file, 'r') as status_file:
        try:
            status = json.load(status_file)

            done = status['status']['done']
            if not done:
                print('Progress: ' + str(status['status']['progress']) + '% [' + 
                      str(status['status']['current_item']) +
                      '/' + str(status['status']['total']) + ']')
        except ValueError as err:
            # Don't care, this happens when the file is written at same time 
            # as reading, just skip and continue
            pass

    time.sleep(results_poll_time)
print('Finished prediction')

with open(os.path.join(results_path, results_json_filename), 'r') as results_file:
    process_results_data = json.load(results_file)
    prediction_results = process_results_data['results']

    for image_results in prediction_results.items():
        # Do something with the prediction results

Add a little clean-up code, comments, and testing and we would be just about done. There is a problem though: This relies on Python being installed and accessible from the command line – and specifically the correct Python environment to host our desired application being on the executable path, with all the required libraries available at execution time. This could be true if there is one Python environment but is a lot less likely if you have multiple Python environments, use Anaconda or Python virtual environments, or want to ship a specific and controlled Python environment to run on. To handle that, we would want to control the operating system’s environment variables to ensure the environment we want is the one that gets executed. To do that, we start the script with the following environment configuration:

python_env_path = r'C:\Projects\Python\Python-On-The-Engine\env'

def initialize_environment():
    python_path = python_project_path + ';' + python_env_path + ';' + \
        python_env_path + r'\Lib;' + python_env_path + r'\DLLs;' + \
        python_env_path + r'\Library\user\bin;' + \
        python_env_path + r'\Library\bin;' + python_env_path + r'\bin'
    path = python_path + ";" + os.getenv('PATH')

    os.environ['PYTHONPATH'] = python_path
    os.environ['PATH'] = path

initialize_environment()

Note: We put the python_path at the start of the PATH environment variable so our desired environment pre-empts any Python which may already be configured on the path.

The above configuration is a full environment configuration for an Anaconda environment installed in C:\Projects\Python\Python-On-The-Engine\env to run the external Python application. You can find the environment.yml file in the repository to copy the environment on your system.

For a complete listing of the in-Engine code used in this example see the cli.predict_selected.py file in the GitHub repository, and the repository’s ReadMe for details on how to run it. Essentially the instructions to run are:

  • Select some JPEG files in an open case in Nuix Workstation
  • Copy the contents of cli.predict_selected.py into the Scripting Console
  • Modify the paths to match your system
  • Press Execute

Conceptually, this is a simple approach, and the external Python application can run with little or no changes from how it might naturally run without the Nuix Engine involved. But it gets a bit more complicated on the in-Engine side of things: we need to ensure we use the correct Python environment to run the application and we need to know how to monitor the application, parse the results and align it with the items that need to be worked on.

 

Microservice

An alternative approach to calling an external application from the Nuix Engine would be to connect to a running instance of the Python application using Sockets or Shared Memory, for example. There are benefits here: You can start the Python application before connecting to it from the Nuix Engine, which removes the responsibility of configuring the environment off the in-Engine script. You can also keep the application running reducing the start and stop time for initializing the external Python environment.

In the command line example, we processed all the images at once to minimize the number of times the environment had to start. One of the costs of that approach is having to parse through a results file to get the relevant data for each item. If we could instead:

  • keep the application running
  • pass one item in at a time
  • have the external application work on that one item then return a response,

we would know exactly which item the results belong to – we wouldn’t have to search for it, or sift through a large results stack to get the correct output.

A common way to approach having an independent long-running application that you want to interact with is to use a microservice and interact with that service using a REST-like API.

This brings us to our second example, which you can find in the microservice package: microservice.predict_selected.py and microservice.predict_service.py files. Like our command line example, we use the img_classifier package to run a simple image classification. This time we use the microservice.predict_service.py module to run it via a Flask application. (You can read the repository’s ReadMe to learn how to run it and read the code to see how it works.) For this article, all you need to know is that it exposes a REST-like interface with a /predict/{item GUID} endpoint we will use to predict on a specific image.

The application design is to use an in-Engine script to:

  • Loop through all the images one at a time
  • Make the POST request to the correct endpoint
  • Send the image via a Multipart Form file upload
  • Read the results back from the response to directly store in the image’s metadata.

In this example, we will see how to use the in-Engine script to communicate with the Flask Python application using Apache’s HttpClient to make the REST calls. You shouldn’t need to install anything new: the HttpClient Java library is already shipped in the Nuix environment and the in-Engine script has full access to the Nuix Java environment.

We will start laying down a framework to make the rest of the application easier to use. At the heart of a REST client is making requests to the server. The base-level request in our application is handled in this code:

import json
from java.nio.charset import Charset
from org.apache.http.impl.client import HttpClients
from org.apache.http.client.methods import RequestBuilder
from org.apache.http.util import EntityUtils

utf8 = Charset.forName('UTF-8')

def do_request(http_request):
    http_client = HttpClients.createDefault()

    try:
        response = http_client.execute(http_request)

        status_code = response.getStatusLine().getStatusCode()
        response_body = EntityUtils.toString(response.getEntity(), utf8)

        if status_code < 300 and response_body is not None:
            return status_code, json.loads(response_body)
        else:
            return status_code, response_body
    finally:
        http_client.close()

The parameter to do_request is an org.apache.http.client.methods.HttpUriRequest object. After the request it reads the response, returning the status code and a dictionary created from the response body’s JSON. We use get(...) and post(...) methods to generate those HttpUriRequest objects and then call this method to execute the request. Since the get(...) and post(...) methods do little more than formatting the URL and re-constructing the return types I won’t show the code; you can see it in the GitHub repository.

The way the in-Engine script works is to loop through all the selected items and then run the prediction for each one individually. The prediction looks like this:

from org.apache.http.entity.mime import MultipartEntityBuilder, HttpMultipartMode
from org.apache.http.entity import ContentType
HOST = 'http://127.0.0.1:8982'

item_guid = item.getGuid()
item_filename = item.getLocalisedName()

image_data = item.getBinary().getBinaryData().getInputStream().readAllBytes()

request_body = MultipartEntityBuilder.create() \
    .setMode(HttpMultipartMode.BROWSER_COMPATIBLE) \
    .addBinaryBody(item_guid, image_data, ContentType.DEFAULT_BINARY, item_filename) \
    .build()
success, response = post(HOST, ['predict', item_guid], body=request_body)

We get some data we need to make the request – such as the item’s GUID and name. We read the item’s binary data, then build a MultipartEntity – the HttpClient class for a Multipart Form file upload – and use that to make a POST request to the microservice’s /predict endpoint. The response will have the prediction results for the image, so we can go straight to parsing and setting the results to the item’s custom metadata:

predictions = prediction['results'][item.getGuid()]
prediction_data = ';'.join([list(pred.items())[0][0] + ':' +
                            str(round(float(list(pred.items())[0][1]) * 100, 2)) + '%'
                            for pred in predictions]
                           )
item_custom_metadata = item.getCustomMetadata()
item_custom_metadata['image_classifier_top3'] = prediction_data

That is most of the code you need! There is some code for formatting URLs, looping through items, and limiting to just JPEGs, but not much else. There is a lot less configuration needed – just the HOST variable to point to the address and port where the microservice runs. The real complexity is pushed off to how to launch the microservice. Running a Flask application isn’t necessarily tough – read the repository’s ReadMe for details on how to configure and run it. Before going into any sort of production environment you should read more about how Flask applications are configured to ensure it is deployed securely for your environment.

 

Using the Java API

It is relatively easy and common to run Python code inside a Java virtual machine – Nuix takes advantage of Jython to do that, and a large part of Part 1 of this series used this feature. But you can also do the opposite: run Java application code in your Python code base. There are several tools available to do this, but for this explanation, I will use the pyjnius module. You can find pyjnius at kivy/pyjnius: Access Java classes from Python (github.com) and install it on Python 3 using pip.

The example presented here will be a sort of “hello world” for the Nuix Engine – starting the engine and getting a license. To understand how to use the Nuix Engine’s Java API you should read the Java Docs found online here: Java API (Engine API 9.6.10) (nuix.com) (and if you have the Nuix Engine installed, the docs are also found locally in the engine’s docs subfolder). You can find the code for this part of the post in the engine_in package in the provided GitHub repository and its usage in the repository’s ReadMe.

Before we use pyjnius, we need to configure the Java environment it will use to access the Nuix Engine’s Java code. This needs to be a JRE or JDK compatible with the Nuix Engine, so I suggest using the JRE shipped with the Engine in the jre subdirectory:

import os
nuix_engine_path = r'C:\Projects\nuix-engine'
def initialize_environment():
    engine_bin = os.path.join(nuix_engine_path, 'bin')
    engine_lib = os.path.join(nuix_engine_path, 'lib', '*')
    engine_ssl = os.path.join(nuix_engine_path, 'lib', 'non-fips', '*')
    engine_jre = os.path.join(nuix_engine_path, 'jre')
    engine_jvm = os.path.join(engine_jre, 'bin', 'server')

    classpath = ';'.join(['.',engine_lib, engine_ssl])
    java_home = engine_jre
    path_update = ';'.join([java_home, engine_jvm,engine_bin])

    os.environ['JAVA_HOME'] = java_home
    os.environ['CLASSPATH'] = classpath
    os.environ['PATH'] = f'{path_update};{os.environ["PATH"]}'

initialize_environment()

We also take this opportunity to ensure the libraries and other binaries the Nuix Engine needs are on the PATH and accessible to the executing environment. With that, we can start to create an instance of the Nuix Engine. Since the necessary classes are Java classes, we will use the autoclass function from pyjnius to import them into Python:

from jnius import autoclass

NUIX_USER = 'Inspector Gadget'
USER_DATA_DIR = r'C:\Projects\RestData'

GlobalContainerFactory = autoclass('nuix.engine.GlobalContainerFactory')
Collectors = autoclass('java.util.stream.Collectors')
global_container = GlobalContainerFactory.newContainer()
try:
    configs = dict_to_immutablemap({'user': NUIX_USER, 'userDataDirs': USER_DATA_DIR})
    engine = container.newEngine(configs)
finally:
    global_container.close()

I haven’t shown a utility method dict_to_immutablemap(…) which is used to convert Python dictionaries to an immutable implementation of java.util.Map. We now have an instance of an Engine, but it isn’t licensed yet. To get the license, we use the following code before the finally block above:

engine.whenAskedForCredentials(PCredentialsCallback())
license_config = dict_to_immutablemap({'sources': [LICENSE_SOURCE_TYPE]})
worker_config = dict_to_immutablemap({'workerCount': WORKER_COUNT})
found_licenses = engine.getLicensor()\
    .findLicenceSourcesStream(license_config) \
    .filter(PLicenseSourcePredicate()) \
    .collect(Collectors.toList())
for license_source in found_licenses:
    print(f'{license_source.getType()}: {license_source.getLocation()}')
    for available_license in license_source.findAvailableLicences():
        license_short_name = available_license.getShortName()
        if LICENSE_TYPE == license_short_name:
            available_license.acquire(worker_config)
            print(f'Acquired {license_short_name} from [{license_source.getType()}] '
                  f'{license_source.getLocation()}')
            break # return out of all the looping

This code relies on some constants – which I’m skipping for brevity – and some checking on available worker counts to be safe. It also requires two callbacks that must implement Java interfaces. The code below shows how to implement the Java interfaces in Python:

from jnius import PythonJavaClass, java_method
class PCredentialsCallback(PythonJavaClass):
    __javainterfaces__ = ['nuix/engine/CredentialsCallback']

    @java_method('(Lnuix/engine/CredentialsCallbackInfo;)V')
    def execute(self, info):
        print('Credentials Callback Called')
        info.setUsername(os.environ['nuix_user'])
        info.setPassword(os.environ['nuix_password'])

class PLicenseSourcePredicate(PythonJavaClass):
    __javainterfaces__ = ['java/util/function/Predicate']

    @java_method('(Ljava/lang/Object;)Z')
    def test(self, licence_source):
        print('License Test Called')
        return LICENSE_SOURCE_LOCATION == licence_source.getLocation()

These are Python objects that implement Java interfaces. The first can be used as a nuix.engine.CredentialsCallback to provide the credentials to log into the cloud server, while the second is a java.util.function.Predicate to filter down to license sources that connect to the desired location. Again, I have omitted some variable definitions here for code brevity.

That’s basically it, except as noted where I skipped code for brevity. With that example, you would be able to create a new instance of a Nuix Engine, claim a license, and be ready to use it in your Python application. See engine_in.grab_license.py in the GitHub repository linked above for the full code. The code was created in Python 3.9 in an environment you can reconstruct using Anaconda with the environment.yml file provided in the repository.

 

Nuix RESTful Service

Our final approach for using Python with the Nuix Engine is to call on the REST API provided by the Nuix RESTful Service. The RESTful Service is a wrapper around the Nuix Engine that allows the engine to be up and running full-time and to allow applications to connect, claim licenses, and do work in the engine as needed. It allows you to share the same instance of the Nuix Engine and case files from multiple applications and computers. It also lets the Nuix Engine run on servers, clusters, and the cloud.

Accessing RESTful services from Python isn’t anything new – it’s standard practice. The main module we use is requests, which you can install in any Python environment with pip or using Anaconda. The environment.yml file in the code repository for this post includes all the necessary Python packages.

If you’re following along from Part 1 of this blog series, you might recognize this as the inverse of the Python microservice example: instead of using a REST interface from the Nuix Engine to call into an external Python service, we’re using Python to call the REST interface into the Nuix Engine running as a service.

The API defining the endpoints we’ll use for calling the Nuix RESTful Service is documented in the Nuix REST API Reference. The Nuix SDK site has many examples of how to use the interface.

For this example, we’ll use the code inside the restful package in the repository. It’s a complete example that will do a paged export of all items in a case, by first doing a paged search, tagging items on each page, and then exporting items in a particular tag or page. The example has several different modules, each of which we’ll describe in varying levels of detail here. To start, let’s look at restful.rest_base.py:

import requests
def post(url, headers, data):
    print("POST: " + url)
    response = requests.post(url, headers=headers, data=data)
    print(response.status_code)
    try:
        response_body = response.json()
    except:
        response_body = response
    return response.status_code, response_body

def get(url, headers):
    print("GET: " + url)
    response = requests.get(url, headers=headers)
    print(response.status_code)
    try:
        response_body = response.json()
    except:
        response_body = response
    return response.status_code, response_body

restful.rest_base.py provides support for interacting with the RESTful API – it has methods for doing POST, GET, PUT, PATCH, HEAD, and DELETE requests to the service. The sample provided here shows GET and POST, as they provide the basic outline for all the others. They take in the full URL (with any query parameters), a dictionary for headers, and sometimes a dictionary for the body of the request. The methods will then make an appropriate requests method call and return a tuple containing the response status code and the parsed JSON body from the response as a dictionary.

Another bit of housekeeping is stored in the resful.nuix_api.py module. We use this to make it a little easier to build the request URLs for the endpoints. There is a class with all the endpoints used in the example as both strings, and methods that replace the parameters in the endpoint paths with variables passed to the methods. For example:

import json
class NuixRestApi:
    with open("config.json") as config_file:
        config = json.load(config_file)['rest']
    service = "nuix-restful-service/svc"

    case_count_path = "cases/{case_id}/count"
    @staticmethod
    def case_count_url(case_id):
        case_count_path = NuixRestApi.case_count_path.format(case_id=case_id)
        return f"{NuixRestApi.config['host']}:{NuixRestApi.config['port']}/" \
               f"{NuixRestApi.service}/{case_count_path}"

This lets us generate the URL for the endpoint to get the count of items in a case using count_endpoint = NuixRestApi.case_count_url(case_id). This helps isolate some of the configuration, such as the host, port and service path and build the full URL without having the configuration and URL building all over the code.

The resful.nuix_api.py module also has a class that holds various Content-Types used by the service to make selecting the versions of endpoints to use a little easier.

Final few bits of utility are in restful.nuix_utility.py which provides some methods for doing some common tasks on the RESTful service, such as logging in and out, doing a paged search, and monitoring async functions:

import os
import json
from nuix_api import NuixRestApi as nuix
from nuix_api import ContentTypes
from rest_base import get, put, delete

def check_ready(headers):
    try:
        status_code, response_body = get(nuix.health_url(), headers)
        return status_code == 200
    except:
        return False

def login(headers):
    usr = os.environ['nuix_user']
    pw = os.environ['nuix_password']

    data = json.dumps({
        "username": usr,
        "password": pw,
        "licenseShortName": config["license"]["type"],
        "workers": config["license"]["workers"]
    })

    headers['Content-Type'] = ContentTypes.V1
    headers['Accept'] = ContentTypes.V1

    try:
        status_code, response_body = put(nuix.login_url(), headers, data)
        if status_code == 201:
            auth_token = response_body["authToken"]
            headers["nuix-auth-token"] = auth_token
            return True
        else:
            print(f"Unexpected return status code when Logging In: "
                  f"{status_code} [{response_body}]")
            return False
    finally:
        # Reset headers to default
        headers['Content-Type'] = ContentTypes.JSON
        headers['Accept'] = ContentTypes.JSON

def logout(headers):
    usr = os.environ['nuix_user']
    status_code, response_body = delete(nuix.logout_url(usr), headers, None)
    return status_code == 200, response_body

The example provided in the restful package of the repository linked above is a complete example that will find a case, get its item counts, tag items in bulk, and export them. For this blog let’s limit the scope to what we did with the previous example: getting a license. Given the groundwork we’ve already done, we can achieve that with this code:

import json
from nuix_api import ContentTypes
import nuix_utility as ute

with open("config.json") as config_file:
    config = json.load(config_file)['rest']

headers = {
    "Content-Type": ContentTypes.JSON,
    "Accept": ContentTypes.JSON
}

ok = ute.check_ready(headers)
if not ok:
    print('Server is not ready')
    exit(9)

ok = ute.login(headers)
if not ok:
    print("Failed to Log in.")
    exit(1)

try:
    # Congrats!  You've logged in.  Do your work here

finally:
    ute.logout(headers)

This, and some of the other code in this post, use a JSON config file to store some settings – that config file is in the repository and contains things like the RESTful service’s host and port, configuration for the licensing, and settings you need for the tagging and export parts of the application which aren’t shown here. You can find the full example in the restful.paged_export.py module and the ReadMe will explain how to use it.

 

Additional Resources

The GitHub repository with the code used in this blog:

Other examples in GitHub:

Documentation hub: