.. code:: ipython3

    from seeq import spy
    import pandas as pd
    
    # Set the compatibility option so that you maximize the chance that SPy will remain compatible with your notebook/script
    spy.options.compatibility = 192

.. code:: ipython3

    # Log into Seeq Server if you're not using Seeq Data Lab:
    spy.login(url='http://localhost:34216', credentials_file='../credentials.key', force=False)

spy.pull
========

Retrieves data for the specified items over a given time range.

Once you have searched for signals, conditions or scalars and have a
DataFrame with an ``ID`` column, you can execute ``spy.pull`` to
retrieve data for those items.

Alternatively, you can execute ``spy.pull('worksheet_url')`` if you want
to pull all the items loaded in an Analysis worksheet by supplying the
URL of the Seeq worksheet. You can copy the value of worksheet_url
directly out of your browser’s address bar without further modification.

Time interval
-------------

You must specify ``start`` and ``end``, which can be any data type or
format that is recognized by the Pandas function
``pandas.to_datetime()``.

Let’s create a DataFrame of signals to retrieve data for:

.. code:: ipython3

    items = spy.search({
        'Path': 'Example >> Cooling Tower 1 >> Area A',
        'Name': 'Compressor'
    })
    
    items

Now we will retrieve that data for a specified time interval:

.. code:: ipython3

    spy.pull(items, start='2019-01-01', end='2019-01-07')

The index of the returned DataFrame is ``pd.Timestamp`` objects with the
timezone that is set for the workstation or server that is running
Jupyter. Each signal is in a separate column.

Alternatively, you can request a specific timezone for both the input
and output (and they can be different):

.. code:: ipython3

    spy.pull(items,
             start=pd.Timestamp('2019-01-01', tz='US/Pacific'),
             end=pd.Timestamp('2019-01-07', tz='US/Pacific'),
             tz_convert='US/Eastern')

If you are pulling items from an Analysis worksheet URL, ``start`` and
``end`` are selected by default from the start and end of the Display
Range in the Trend view of the worksheet.

``worksheet_url = 'http://localhost:34216/workbook/<workbookID>/worksheet/<worksheetID>'``

``spy.pull(url)``

However, you can always specify a different ``start`` or ``end`` when
executing ``spy.pull``

``spy.pull(worksheet_url, start='2019-01-01', end='2019-01-07')``

Gridding
--------

By default, SPy Pull will interpolate data onto a *grid* at 15 minute
intervals. This is often desirable for computation within matrices, and
Seeq goes to great lengths to ensure that interpolation is done
correctly.

You can change spacing by specifying a ``grid`` parameter with units of
``s``, ``min``, ``h`` or ``d``. For example, ``grid='1h'`` will cause
samples to be spaced every hour. For dense data, specifying a larger
grid interval can significantly reduce the amount of data transferred to
Python, which can reduce time taken to pull data.

You can also specify a time zone for alignment:
``grid="12h, 'Canada/Pacific'"``. To align samples to an offset, include
both a timezone and a date or timestamp formatted as YYYY-MM-DD,
YYYY-MM-DDTHH:MM:SS, or ISO8601. For example, to align samples every day
at every 7:17 AM and 7:17 PM in Vancouver, even across daylight savings
time changes: ``grid="12h,'Canada/Pacific','2020-06-01T07:17:00'"``.

Alternatively, specifying ``grid='auto'`` will automatically estimate a
grid value. The grid can be determined in either of two ways:

1. If the column ``Estimated Sample Period`` exists in the query
   DataFrame, the grid value will be the median of the sample periods in
   the query DataFrame. Note that in this case the sample period is
   determined for the specified time period issued by
   ``spy.search(estimate_sample_period=dict(Start='<sampling_start_time>',  End='<sampling_end_time>'))``.
2. If the column ``Estimated Sample Period`` does not exist in the query
   DataFrame, the grid will be determined by running the
   ``estimateSamplePeriod()`` Formula function on each signal in the
   query DataFrame for the specified time period with the ``start`` and
   ``end`` parameters and then taking the median of all the sample
   periods. It is highly recommended to estimate the sample period using
   ``spy.search`` first and to validate the estimates since the time
   taken to pull data can significantly increase for dense data and
   queries with a large number of signals.

All signals in Seeq honor their ``Maximum Interpolation`` property. If
the gap between two adjacent samples is larger than
``Maximum Interpolation``, then no interpolation is performed and you
may see ``NaN`` entries in the DataFrame, which indicates that there is
no appropriate value for that cell. This concept of
``Maximum Interpolation`` has both a functional purpose and performance
implications. It can prevent a process variable like a temperature
reading from “flatlining” inappropriately if no data is collected for a
period of time. It is also constrains the amount of data that must be
retrieved from a source system in order to interpolate correctly to the
edges of a request interval.

Specifying ``grid=None`` retrieves the data with *raw* timestamps and no
interpolation. On rows where signals don’t happen to all have values at
a particular timestamp, you will see ``NaN`` entries.

**NaN vs None**: ``spy.pull()`` uses ``NaN`` to represent a *missing
value*, consistent with Pandas idioms. ``None``, on other hand,
represents an *invalid* value, which in Seeq Workbench will break
interpolation if one or more invalid samples are located between two
valid samples.

Column headers
--------------

By default, column headers will contain the full path to the item (if
it’s part of an asset tree) or just the name of the item (if it’s not in
an asset tree).

You can force the column headers to be populated with a particular
property using the ``header`` argument. In this example, the column
header will use just the ``Name`` property:

.. code:: ipython3

    spy.pull(items, start='2019-01-01T12:00:00', end='2019-01-01T13:00:00', header='Name')

Alternatively, you can supply a ``Header`` column in the ``items``
DataFrame whose values indicate the column headers you would like to be
used for each item that is pulled.

Applying a Seeq Formula
-----------------------

You can leverage Seeq’s time series calculation engine easily by adding
a ``calculation`` argument to ``spy.pull()``. For example, let’s say we
want calculate an hourly average and represent it as a step-interpolated
signal. You can figure out what to put as the ``calculation`` argument
by experimenting in Seeq Workbench and going to the *Item Properties*
pane to look at the formula for a particular calculated item in Seeq.
Then you copy and paste that *Formula* property into your notebook,
taking care to use either ``$signal``, ``$condition`` or ``$scalar`` (as
appropriate) for the variable name.

Now when we fetch the data, it will have the formula applied to the
output:

.. code:: ipython3

    signal_with_calc = spy.search({'Name': 'Area A_Temperature', 'Datasource Name': 'Example Data'})
    
    spy.pull(signal_with_calc,
             start='2019-01-01T00:00:00',
             end='2019-01-01T03:00:00',
             calculation='$signal.aggregate(average(), hours(), startKey())',
             grid=None)

Alternatively, you can supply a ``Calculation`` column in the ``items``
DataFrame and specify the formula for each item independently. Here we
specify both a ``Calculation`` and a ``Header`` column to retrieve the
same signal with different formulas:

.. code:: ipython3

    area_a_rel_humidity = spy.search({
        'Path': 'Example >> Cooling Tower 1',
        'Asset': 'Area A',
        'Name': 'Relative Humidity',
        'Datasource Name': 'Example Data'
    })
    two_calcs = pd.concat([area_a_rel_humidity, area_a_rel_humidity]).reset_index()
    two_calcs['Calculation'] = pd.Series([
        '$signal.aggregate(average(), hours(), startKey())',
        '$signal.aggregate(average(), periods(2h), startKey())'
    ])
    two_calcs['Header'] = pd.Series([
        '1-hour Average',
        '2-hour Average'
    ])
    
    spy.pull(two_calcs, start='2019-01-01T00:00:00', end='2019-01-01T05:00:00', grid=None)

Applying a Seeq Formula Across Assets
-------------------------------------

If you have an asset tree (say, from OSIsoft Asset Framework), you can
apply a calculation across assets as long as those assets have the
necessary attributes (signals, conditions or scalars) with identical
names.

To get started, it’s easiest to create the calculation in Seeq Workbench
using a single asset first. You can “swap” to other assets by finding an
asset in the Data tab and clicking the “swap” button. In this example,
we’re going to use spy.push() to create a calculated signal for use
later.

.. code:: ipython3

    area_a_signals = spy.search({
        'Path': 'Example >> Cooling Tower 1 >> Area A'
    })
    
    dew_point_calc = spy.push(metadata=pd.DataFrame([{
        'Type': 'Signal',
        'Name': 'Dew Point',
        # From https://iridl.ldeo.columbia.edu/dochelp/QA/Basic/dewpoint.html
        'Formula': "$T - ((100 - $RH.setUnits(''))/5)",
        'Formula Parameters': {
            '$T': area_a_signals[area_a_signals['Name'] == 'Temperature'],
            '$RH': area_a_signals[area_a_signals['Name'] == 'Relative Humidity']
        }
    }]), workbook='SPy Documentation Examples >> spy.pull')
    
    dew_point_calc

Now you can search for all the assets you want to apply this calculation
to, and use the ``calculation`` argument in ``spy.pull()`` to retrieve
data for the various assets.

.. code:: ipython3

    all_areas = spy.search({
        'Path': 'Example >> Cooling Tower 1'
    }, recursive=False)
    
    spy.pull(all_areas, calculation=dew_point_calc)

Conditions
----------

Conditions capture time intervals of interest as *capsules*, which have
a start time and an end time. They will not have an end time if they are
*uncertain* (aka *in progress*), and they may not have a start or end
time if it lies outside of the time interval (see below for more
information).

In order to illustrate how condition data can be pulled, we must first
push a simple condition into Seeq. (Don’t worry– this condition will be
scoped to a workbook that only you can see.)

.. code:: ipython3

    compressor_on_high = spy.push(metadata=pd.DataFrame([{
        'Name': 'Compressor on High',
        'Type': 'Condition',
    
        'Formula Parameters': {
            # Note here that we are just using a row from our search results. The SPy module will figure
            # out that it contains an identifier that we can use.
            '$cp': items[items['Name'] == 'Compressor Power']
        },
    
        # This formula specifies capsules to be created when Compressor Power is above 25 kW
        'Formula': '$cp.valueSearch(isGreaterThan(25kW))'
    }]), workbook='SPy Documentation Examples >> spy.pull')
    
    compressor_on_high

We can now pull data for a particular time period for that condition:

.. code:: ipython3

    spy.pull(compressor_on_high, start='2019-01-01T04:00:00Z', end='2019-01-09T02:00:00Z')

If there were multiple conditions, the ``Condition`` field would show
other conditions further down.

Notice that the first capsule (at index 0) has no start time. That’s
because it starts outside the supplied time interval. Similarly, the
last capsule (at index 8) has no end time because it ends outside the
supplied time interval.

``Capsule Is Uncertain`` will be ``True`` for capsules whose exit
criteria has not yet been satisfied and may change as a result of new
data arriving in the system. In Seeq Workbench, such capsules are
“hollowed out” in the Gantt bars at the top of the trend.

If we want to know the start and end time even if they’re outside our
time interval, we can specify a ``calculation`` argument that applies
the ``removeLongerThan()`` function, which allows the Seeq calculation
engine to retrieve start/end times.

For example:

.. code:: ipython3

    spy.pull(compressor_on_high,
             start='2019-01-01T04:00:00Z',
             end='2019-01-09T02:00:00Z',
             calculation='$condition.removeLongerThan(1d)')

Now the first and last capsules have both starts and ends.

It’s important to note that setting the *Maximum Duration* for a
Condition to something high can have major performance implications
because the Seeq calculation engine is required to potentially evaluate
significant amounts of data before/after the query interval to “find”
the start/end of the capsule.

Conditions as signals
~~~~~~~~~~~~~~~~~~~~~

You can also request the condition data to be represented as a signal
with a value of ``1`` if a capsule is present at a particular timestamp
or ``0`` if not. Use ``shape='samples'`` like so:

.. code:: ipython3

    spy.pull(compressor_on_high, start='2019-01-01T00:00:00Z', end='2019-01-01T12:00:00Z', shape='samples', grid='1h')

Mixing conditions with signals
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

If the DataFrame that you pass into the ``spy.pull()`` function contains
a mix of signals and conditions, ``shape='samples'`` will be
automatically applied:

.. code:: ipython3

    compressor_signal_and_condition = pd.concat([spy.search({
        'Path': 'Example >> Cooling Tower 1 >> Area A',
        'Name': 'Compressor Power',
        'Datasource Name': 'Example Data'
    }), spy.search({
        'Name': 'Compressor*High',
    }, workbook='SPy Documentation Examples >> spy.pull')], sort=False, ignore_index=True)
    
    spy.pull(compressor_signal_and_condition,
             start='2019-01-01T00:00:00Z', end='2019-01-01T12:00:00Z',
             header='Name', grid='1h')

You can supply ``shape='capsules'`` to force the output to capsules. In
such a case, the signals within the DataFrame will be aggregated within
the interval of each capsule according to a statistic. By default, that
statistic is ``average``, but you can add a ``Statistic`` column to the
passed-in DataFrame to control which statistic is used. To return
multiple statistics for the same signal, add additional signal rows to
your DataFrame that only vary by the ``Statistic`` column. The possible
statistic function names you can specify are documented in Seeq
Workbench’s Formula tool documentation under the heading *Signal Value
Statistics*.

Let’s try this example:

.. code:: ipython3

    area_a_compressor_power = spy.search({
        'Path': 'Example >> Cooling Tower 1 >> Area A',
        'Name': 'Compressor Power',
        'Datasource Name': 'Example Data'
    }, quiet=True)
    
    compressor_on_high = spy.search({
        'Name': 'Compressor*High',
    }, workbook='SPy Documentation Examples >> spy.pull', quiet=True).iloc[0]
    
    area_a_compressor_power_max = area_a_compressor_power.iloc[0].copy()
    area_a_compressor_power_max['Statistic'] = 'maximum'
    
    area_a_compressor_power_delta = area_a_compressor_power.iloc[0].copy()
    area_a_compressor_power_delta['Statistic'] = 'totalized'
    
    conditions_with_stats = pd.DataFrame([
        compressor_on_high,
        area_a_compressor_power_max,
        area_a_compressor_power_delta
    ]).reset_index(drop=True)
    
    conditions_with_stats

Notice that the ``Compressor Power`` signal appears twice in this
DataFrame but has a different value in the ``Statistic`` column. If we
pull with this DataFrame and the ``shape='capsules'`` argument, we’ll
get columns for these aggregates:

.. code:: ipython3

    pull_results = spy.pull(conditions_with_stats,
                            start='2019-01-01T00:00:00Z', end='2019-01-07T00:00:00Z',
                            shape='capsules', header='Name', grid='1h')
    
    # For brevity, drop uncertainty column
    pull_results.drop(columns=['Capsule Is Uncertain'])

Scalars
-------

Scalars represent a constant value across all time. Let’s push one into
Seeq to see how it looks when we retrieve it:

.. code:: ipython3

    compressor_power_limit = spy.push(metadata=pd.DataFrame([{
        'Name': 'Compressor Power Limit',
        'Type': 'Scalar',
        'Formula': '50kW'
    }]), workbook='SPy Documentation Examples >> spy.pull', errors='raise')
    
    compressor_power_limit

.. code:: ipython3

    spy.pull(compressor_power_limit)

Pickling
--------

Sometimes a pull is “expensive” and takes a long time to complete, and
the data likely hasn’t changed. In that case, it is common to “pickle”
the resulting DataFrame for later use, like so:

.. code:: ipython3

    pull_results.to_pickle('pickled_pull.pkl')

Now you can “unpickle” the DataFrame and use it without incurring the
cost of re-executing the pull:

.. code:: ipython3

    unpickled_pull = pd.read_pickle('pickled_pull.pkl')

You can access some of the original context of the search via the
``spy`` attribute of the unpickled DataFrame. For example, the
``spy.status`` tells you what queries were successful and what queries
failed.

.. code:: ipython3

    unpickled_pull.spy.status.df

Performance Metrics
-------------------

If your pull operation takes a long time to complete, you can take a
look at the timing information for data processing by inspecting the
“Data Processed” cell in the status DataFrame like so:

.. code:: ipython3

    # Pull two months of data
    pull_results = spy.pull(spy.search({'Name': 'Area B_Compressor Power'}), start='2019-01-01', end='2019-03-01', grid=None)

.. code:: ipython3

    # Inspect the 'Data Processed' column in the first row of the status DataFrame
    pull_results.spy.status.df.iloc[0]['Data Processed']


.. parsed-literal::

    Time spent in Request Queue:           0:00:00.000340
    Time spent reading Metadata:           0:00:00.001720
    Time spent waiting for Datasource(s):  0:00:00
    Time spent reading from Seeq Cache:    0:00:00
    Time spent in Calc Engine Queue:       0:00:00
    Time spent in Calc Engine:             0:00:00.038660
    Time spent reclaiming Memory:          0:00:00
    
    Metadata items read:                                2
    Metadata relationships read:                        0
    Samples read from Datasource(s):                    0
    Samples read from Persistent Cache:                 0
    Samples read from In-Memory Cache:              42482
    Capsules read from Datasource(s):                   0
    Capsules read from Persistent Cache:                0
    Capsules read from In-Memory Cache:                 0
    Total bytes processed:                         679712


Note that the “Time spent” values here are inclusive of the time spent
within the Seeq service. They are *exclusive* of time spent transmitting
the data from the Seeq service to SPy, and the time spent by SPy
organizing the data into a Pandas DataFrame.

Detailed Help
-------------

All SPy functions have detailed documentation to help you use them. Just
execute ``help(spy.<func>)`` like you see below.


.. code:: ipython3

    help(spy.pull)

API Reference Links
-------------------

- :py:func:`seeq.spy.pull`