Writing and reading Python objects – the pickle module

We’ve seen how to write and read text files. This is fine if you want to store the data in a human readable form or transfer the data to some other foreign system. However, the problem with writing data as text is that all the internal structure of the data, in the form of the Python objects that store the data, is lost. If we want to write a Python list to a file and read it in later, we would need to write code to transform the text that we read in into a Python list object before we could use it.

The pickle module provides a way to save Python objects in binary form that preserves their class structure, so that they can be read in later and used as Python objects without any modification. To illustrate, we use the example from the earlier post in which we generate a deck of cards as a list of tuples. Here’s how we would write the deck and read it back in using pickle:

import pickle

def deck():
    """ Generate a full deck of 52 cards as a list of tuples """
    suits = ['Spades', 'Hearts', 'Diamonds', 'Clubs']
    rank = ['Ace'] + list(range(2,11)) + ['Jack', 'Queen', 'King']
    cards = []
    for suit in suits:
        for spot in rank:
            cards += [(suit, spot)]
    return cards

cardPickleFile = open('Cards.pkl', 'wb')
pickle.dump(deck(), cardPickleFile)
cardPickleFile.close()

cardPickleFile = open('Cards.pkl', 'rb')
cards = pickle.load(cardPickleFile)
cardPickleFile.close()
print(cards)
print(type(cards), type(cards[0]))

cardPickleString = pickle.dumps(deck())
print(cardPickleString)

cards = pickle.loads(cardPickleString)
print(cards)
print(type(cards), type(cards[0]))

The deck() function is the same one we’ve used in the past for generating a list of tuples, with each tuple representing a card in the deck. The cards object returned on line 11 is the complete list of 52 cards.

To save this using pickle, we open a file for writing on line 13. Note that the mode of the file is given as 'wb', where the ‘b’ means that the file will store binary data, as opposed to text. This is because pickle writes the data in a binary format that is not human-readable, and in fact is not readable by programs written in any other computer language either. In fact, there are several different versions (known as protocols) that have been used by pickle over the years, and data written using a more recent protocol may not be readable by an earlier pickle protocol.

Writing data is done on line 14, where we use pickle’s dump() method to write its first argument (the cards object returned by a call to deck()) to the file object given as the second argument. We then close() the file.

To verify that the dump() has worked, we reopen the file on line 17 (again, notice that the mode is 'rb', indicating the file is binary and we wish to read from it), and read the cards object using pickle’s load() method. We print out cards on line 19 to verify that the load has worked. Finally, we print out the data types of the cards object itself (a list) and of the first element in the list (which is a tuple).

pickle can also save data to and read from a string, rather than a file. Line 23 uses pickle’s dumps() method to write the cards list to cardPickleString. On line 24, we print out this string so that you can see that it’s just binary data, not human-readable. Line 26 reads the cards list back from cardPickleString using pickle’s loads() method. You can think of the extra ‘s’ on the ends of dumps() and loads() as standing for ‘string’.

Pickle and user-defined classes

Pickle allows you to save objects created from user-defined classes, at least in most cases. Consider this example. We define a Polygon class in Polygon.py:

from math import *

class Polygon:
    def __init__(self, numSides, sideLength) :
        self.__numSides = numSides
        self.__sideLength = sideLength
        if numSides == 3:
            self.__area = sqrt(3) * sideLength * sideLength / 4
        elif numSides == 4:
            self.__area = sideLength * sideLength
        elif numSides == 5:
            self.__area = sqrt(5 * (5 + 2*sqrt(5))) * sideLength * sideLength / 4

    def getArea(self):
        return self.__area

This class allows a regular polygon with 3, 4 or 5 equal length sides to be created. The constructor then calculates the area of the polygon using a standard formula.

Now we write a main program that creates a list of several polygons. We generate a second list containing the areas of the polygons and save both of these lists to a file using pickle.

import pickle
from Polygon import *
from random import *

polyList = []
for i in range(10):
    polyList += [Polygon(randint(3,5), 10 * random())]

polyArea = list(map(lambda x: x.getArea(), polyList))
print(polyList, polyArea)

polyPickle = open('PolyPickle.pkl', 'wb')
pickle.dump(polyList, polyPickle)
pickle.dump(polyArea, polyPickle)
polyPickle.close()

polyPickle = open('PolyPickle.pkl', 'rb')
polyList = pickle.load(polyPickle)
polyArea = pickle.load(polyPickle)
polyPickle.close()

print(polyList, polyArea)

The original lists are printed out on line 10. We then save both these lists to the same file using pickle on lines 12 to 15. We can save multiple items to the same pickle file, since the dump() method always appends new data to the end of the file.

On lines 17 to 20 we read the same lists back from the pickle file, and print them on line 22 to verify that the process has worked. Again, we can load multiple items from the same pickle file since each load() call retains a marker in the file where it left off so a subsequent load() can begin reading from that point, provided the file is not closed and reopened in the meantime. [As far as I can determine, this process does not work when data are saved to a string using dumps(), since the loads() method doesn’t keep track of where in the string it left off reading, so a subsequent loads() call just starts at the beginning of the string again. If anyone knows differently, please do leave a comment.]

Lambda functions and dill

Now suppose we change the Polygon class above to the following:

from math import *

class Polygon:
    area = {3: lambda x: sqrt(3) * x * x / 4, 
            4: lambda x: x * x, 
            5: lambda x: sqrt(5 * (5 + 2*sqrt(5))) * x * x / 4
        }

    def __init__(self, numSides, sideLength) :
            self.__numSides = numSides
            self.__sideLength = sideLength
            self.__area = Polygon.area[numSides](sideLength)

    def getArea(self):
        return self.__area

Instead of a multi-branched if statement to calculate the area, we define a dictionary called area which uses lambda functions to calculate the area, based on the number of sides and side length. The changes to the Polygon class should be invisible to any code using this class since the area is calculated internally and is accessible only via the getArea() method. Thus, we might hope that if we run the same main program as above, it should still work.

However, you will find that the attempts to use pickle to save the data to a file raise errors, saying that lambda functions cannot be pickled. This is one area where the usual pickle methods don’t work.

Python provides another package called dill which does allow classes containing lambda functions to be saved and loaded from files. Fortunately, the alterations required are fairly simple. We just need to replace ‘pickle’ by ‘dill’ in our original program, so we get:

import dill
from Polygon import *
from random import *

polyList = []
for i in range(10):
    polyList += [Polygon(randint(3,5), 10 * random())]

polyArea = list(map(lambda x: x.getArea(), polyList))
print(polyList, polyArea)

polyPickle = open('PolyPickle.pkl', 'wb')
dill.dump(polyList, polyPickle)
dill.dump(polyArea, polyPickle)
polyPickle.close()

polyPickle = open('PolyPickle.pkl', 'rb')
polyList = dill.load(polyPickle)
polyArea = dill.load(polyPickle)
polyPickle.close()

print(polyList, polyArea)

We see that we now import dill on line 1, and replace calls to pickle.dump() and pickle.load() with dill.dump() and dill.load(). This version works without errors.

Security risks

Any discussion of pickle and dill should mention that these features do pose security risks. This occurs because they may execute some live code that is hidden in the data file that is loaded. I don’t want to go into details (because I don’t want to be accused of revealing how to hack into someone else’s system!), but the bottom line is that you should use pickle or dill to load data only if you know and trust the source of the data. In other words, don’t just download some pickle file off the internet and read it into your own program; doing so could lay your system open to attack.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.