David Schmudde / Oct 15 2020

Adventures in Immutable Python

Arrow

import pyarrow as pa
import pandas as pd

df = pd.DataFrame({"a": [1, 2, 3]})
# Convert from pandas to Arrow
table = pa.Table.from_pandas(df)
# Convert back to pandas
df_new = table.to_pandas()

# Infer Arrow schema from pandas
schema = pa.Schema.from_pandas(df)

StaticFrame

Pandas

All pandas data structures are value-mutable (the values they contain can be altered) but not always size-mutable. The length of a Series cannot be changed, but, for example, columns can be inserted into a DataFrame. However, the vast majority of methods produce new objects and leave the input data untouched. In general we like to favor immutability where sensible.

artwork_data.csv
import numpy as np
import pandas as pd

artwork_data = pd.read_csv(
artwork_data.csv
) artwork_data.drop(columns=["accession_number", "artistRole", "artistId", "dateText", "acquisitionYear", "dimensions", "width", "height", "depth", "creditLine", "units", "inscription", "thumbnailCopyright", "thumbnailUrl", "url"]) # (inplace=True) leaves id, artist, title, medium, year

Rationale

Immutable data structures reduce opportunities for error and promote the design of pure functions, offering programs that are easier to reason about and maintain. While Pandas is used in many domains where such benefits are highly desirable, there is no way to enforce immutability in Pandas.

In Action

StaticFrame aspires to have comparable or better performance than Pandas. While this is already the case for some core operations (See Performance), some important functions are far more performant in Pandas (such as reading delimited text files via pd.read_csv).

import static_frame as sf

df = sf.Frame.from_pandas(artwork_data)

print(df.shape)
df.dtypes

StaticFrame interfaces for extracting data will be familiar to Pandas users, though with a number of interface refinements to remove redundancies and increase consistency. On a Frame, __getitem__ is (exclusively) a column selector; loc and iloc are (with one argument) row selectors or (with two arguments) row and column selectors.

df['artist': 'year'].tail()

Instead of in-place assignment, an assign interface object (similar to the Frame.astype interface shown above) is provided to expose __getitem__, loc, and iloc interfaces that, when called with an argument, return a new object with the desired changes. These interfaces expose the full range of expressive assignment-like idioms found in Pandas and NumPy. Arguments can be single values, or Series and Frame objects, where assignment will align on the Index.

StaticFrame immutability:

def inc(x):
  x+=1
  return x

print("Original: " + str(df.loc[69196, 'acquisitionYear']))
df.assign.loc[69196, 'acquisitionYear'](inc(df.loc[69196, 'acquisitionYear']))
df.loc[69196, 'acquisitionYear']
2013.0

Updating a StaticFrame structure requires creating a new one:

print("Original: " + str(df.loc[69196, 'acquisitionYear']))
df_updated = df_updated.assign.loc[69196, 'acquisitionYear'](inc(df.loc[69196, 'acquisitionYear']))
df_updated.loc[69196, 'acquisitionYear']
2014.0

Pandas mutability:

print("Original: " + str(artwork_data.loc[69196, 'acquisitionYear']))
artwork_data.at[69196, 'acquisitionYear'] = inc(artwork_data.at[69196, 'acquisitionYear'])
artwork_data.loc[69196, 'acquisitionYear']
2014.0

When the cell is run again, the value of artwork_data.at[69196, 'acquisitionYear'] has already been mutated.

print("Original: " + str(artwork_data.loc[69196, 'acquisitionYear']))
artwork_data.at[69196, 'acquisitionYear'] = inc(artwork_data.at[69196, 'acquisitionYear'])
artwork_data.loc[69196, 'acquisitionYear']
2015.0

Documentation

StaticFrame does not implement its own types or numeric computation routines, relying entirely on NumPy. NumPy offers desirable stability in performance and interface. For working with SciPy and related tools, StaticFrame exposes easy access to NumPy arrays.

The static_frame.Series and static_frame.Frame store data in immutable NumPy arrays. Once created, array values cannot be changed. StaticFrame manages NumPy arrays, setting the ndarray.flags.writeable attribute to False on all managed and returned NumPy arrays.

Hand Spun

Official Docs

I would argue that is is better style to pass that variable in as an argument to the function, or create a class that contains that variable and the function. Using globals in python is usually a bad idea.

via Joop

class Bla(object):
    def __init__(self):
        self._df = pd.DataFrame(index=[1,2,3])

    @property
    def df(self):
        return self._df.copy()
a = [0,1]
a.append(2)
print(a)
a_new = a + [3]
print(a)
a_new
[0, 1, 2, 3]
import pandas as pd
test_s = pd.Series([1,2,3])
print("1st: %s %s Length: %s" % (id(test_s), id(test_s.array), len(test_s)))
test_s[3] = 37
print("2nd: %s %s Length: %s" % (id(test_s), id(test_s.array), len(test_s)))

Appending and deleting are allowed, but that doesn't necessarily imply the Series is mutable.

Series/DataFrames are internally represented by NumPy arrays which are immutable (fixed size) to allow a more compact memory representation and better performance.

When you assign to a Series, you're actually calling Series.__setitem__ (which then delegates to NDFrame.__loc__) which creates a new array. This new array is then assigned back to the same Series (of course, as the end user, you don't get to see this), giving you the illusion of mutability.