Adventures in Immutable Python

Arrow
- Documentation with Pandas
import pyarrow as pa import pandas as pd df = pd.DataFrame({"a": [1, 2, 3]}) # Convert from pandas to Arrow table = pa.Table.from_pandas(df) # Convert back to pandas df_new = table.to_pandas() # Infer Arrow schema from pandas schema = pa.Schema.from_pandas(df)
StaticFrame
Pandas
All pandas data structures are value-mutable (the values they contain can be altered) but not always size-mutable. The length of a Series cannot be changed, but, for example, columns can be inserted into a DataFrame. However, the vast majority of methods produce new objects and leave the input data untouched. In general we like to favor immutability where sensible.
import numpy as np import pandas as pd artwork_data = pd.read_csv(artwork_data.csv) artwork_data.drop(columns=["accession_number", "artistRole", "artistId", "dateText", "acquisitionYear", "dimensions", "width", "height", "depth", "creditLine", "units", "inscription", "thumbnailCopyright", "thumbnailUrl", "url"]) # (inplace=True) leaves id, artist, title, medium, year
Rationale
Immutable data structures reduce opportunities for error and promote the design of pure functions, offering programs that are easier to reason about and maintain. While Pandas is used in many domains where such benefits are highly desirable, there is no way to enforce immutability in Pandas.
In Action
StaticFrame aspires to have comparable or better performance than Pandas. While this is already the case for some core operations (See Performance), some important functions are far more performant in Pandas (such as reading delimited text files via pd.read_csv
).
import static_frame as sf df = sf.Frame.from_pandas(artwork_data) print(df.shape) df.dtypes
StaticFrame interfaces for extracting data will be familiar to Pandas users, though with a number of interface refinements to remove redundancies and increase consistency. On a Frame
, __getitem__
is (exclusively) a column selector; loc
and iloc
are (with one argument) row selectors or (with two arguments) row and column selectors.
df['artist': 'year'].tail()
Instead of in-place assignment, an assign
interface object (similar to the Frame.astype
interface shown above) is provided to expose __getitem__
, loc
, and iloc
interfaces that, when called with an argument, return a new object with the desired changes. These interfaces expose the full range of expressive assignment-like idioms found in Pandas and NumPy. Arguments can be single values, or Series
and Frame
objects, where assignment will align on the Index.
StaticFrame immutability:
def inc(x): x+=1 return x print("Original: " + str(df.loc[69196, 'acquisitionYear'])) df.assign.loc[69196, 'acquisitionYear'](inc(df.loc[69196, 'acquisitionYear'])) df.loc[69196, 'acquisitionYear']
Updating a StaticFrame structure requires creating a new one:
print("Original: " + str(df.loc[69196, 'acquisitionYear'])) df_updated = df_updated.assign.loc[69196, 'acquisitionYear'](inc(df.loc[69196, 'acquisitionYear'])) df_updated.loc[69196, 'acquisitionYear']
Pandas mutability:
print("Original: " + str(artwork_data.loc[69196, 'acquisitionYear'])) artwork_data.at[69196, 'acquisitionYear'] = inc(artwork_data.at[69196, 'acquisitionYear']) artwork_data.loc[69196, 'acquisitionYear']
When the cell is run again, the value of artwork_data.at[69196, 'acquisitionYear']
has already been mutated.
print("Original: " + str(artwork_data.loc[69196, 'acquisitionYear'])) artwork_data.at[69196, 'acquisitionYear'] = inc(artwork_data.at[69196, 'acquisitionYear']) artwork_data.loc[69196, 'acquisitionYear']
Documentation
StaticFrame does not implement its own types or numeric computation routines, relying entirely on NumPy. NumPy offers desirable stability in performance and interface. For working with SciPy and related tools, StaticFrame exposes easy access to NumPy arrays.
The static_frame.Series
and static_frame.Frame
store data in immutable NumPy arrays. Once created, array values cannot be changed. StaticFrame manages NumPy arrays, setting the ndarray.flags.writeable
attribute to False on all managed and returned NumPy arrays.
Hand Spun
Official Docs
- Objects, Values, and Types: lists mutable and immutable types
I would argue that is is better style to pass that variable in as an argument to the function, or create a class that contains that variable and the function. Using globals in python is usually a bad idea.
via Joop
class Bla(object): def __init__(self): self._df = pd.DataFrame(index=[1,2,3]) def df(self): return self._df.copy()
a = [0,1] a.append(2) print(a) a_new = a + [3] print(a) a_new
import pandas as pd test_s = pd.Series([1,2,3]) print("1st: %s %s Length: %s" % (id(test_s), id(test_s.array), len(test_s))) test_s[3] = 37 print("2nd: %s %s Length: %s" % (id(test_s), id(test_s.array), len(test_s)))
Appending and deleting are allowed, but that doesn't necessarily imply the Series is mutable.
Series/DataFrames are internally represented by NumPy arrays which are immutable (fixed size) to allow a more compact memory representation and better performance.
When you assign to a Series, you're actually calling Series.__setitem__
(which then delegates to NDFrame.__loc__
) which creates a new array. This new array is then assigned back to the same Series (of course, as the end user, you don't get to see this), giving you the illusion of mutability.