|
| 1 | +================================== |
| 2 | +A Guide to Masked Arrays in NumPy |
| 3 | +================================== |
| 4 | + |
| 5 | +.. Contents:: |
| 6 | + |
| 7 | +See http://www.scipy.org/scipy/numpy/wiki/MaskedArray (dead link) |
| 8 | +for updates of this document. |
| 9 | + |
| 10 | + |
| 11 | +History |
| 12 | +------- |
| 13 | + |
| 14 | +As a regular user of MaskedArray, I (Pierre G.F. Gerard-Marchant) became |
| 15 | +increasingly frustrated with the subclassing of masked arrays (even if |
| 16 | +I can only blame my inexperience). I needed to develop a class of arrays |
| 17 | +that could store some additional information along with numerical values, |
| 18 | +while keeping the possibility for missing data (picture storing a series |
| 19 | +of dates along with measurements, what would later become the `TimeSeries |
| 20 | +Scikit <http://projects.scipy.org/scipy/scikits/wiki/TimeSeries>`__ |
| 21 | +(dead link). |
| 22 | + |
| 23 | +I started to implement such a class, but then quickly realized that |
| 24 | +any additional information disappeared when processing these subarrays |
| 25 | +(for example, adding a constant value to a subarray would erase its |
| 26 | +dates). I ended up writing the equivalent of *numpy.core.ma* for my |
| 27 | +particular class, ufuncs included. Everything went fine until I needed to |
| 28 | +subclass my new class, when more problems showed up: some attributes of |
| 29 | +the new subclass were lost during processing. I identified the culprit as |
| 30 | +MaskedArray, which returns masked ndarrays when I expected masked |
| 31 | +arrays of my class. I was preparing myself to rewrite *numpy.core.ma* |
| 32 | +when I forced myself to learn how to subclass ndarrays. As I became more |
| 33 | +familiar with the *__new__* and *__array_finalize__* methods, |
| 34 | +I started to wonder why masked arrays were objects, and not ndarrays, |
| 35 | +and whether it wouldn't be more convenient for subclassing if they did |
| 36 | +behave like regular ndarrays. |
| 37 | + |
| 38 | +The new *maskedarray* is what I eventually come up with. The |
| 39 | +main differences with the initial *numpy.core.ma* package are |
| 40 | +that MaskedArray is now a subclass of *ndarray* and that the |
| 41 | +*_data* section can now be any subclass of *ndarray*. Apart from a |
| 42 | +couple of issues listed below, the behavior of the new MaskedArray |
| 43 | +class reproduces the old one. Initially the *maskedarray* |
| 44 | +implementation was marginally slower than *numpy.ma* in some areas, |
| 45 | +but work is underway to speed it up; the expectation is that it can be |
| 46 | +made substantially faster than the present *numpy.ma*. |
| 47 | + |
| 48 | + |
| 49 | +Note that if the subclass has some special methods and |
| 50 | +attributes, they are not propagated to the masked version: |
| 51 | +this would require a modification of the *__getattribute__* |
| 52 | +method (first trying *ndarray.__getattribute__*, then trying |
| 53 | +*self._data.__getattribute__* if an exception is raised in the first |
| 54 | +place), which really slows things down. |
| 55 | + |
| 56 | +Main differences |
| 57 | +---------------- |
| 58 | + |
| 59 | + * The *_data* part of the masked array can be any subclass of ndarray (but not recarray, cf below). |
| 60 | + * *fill_value* is now a property, not a function. |
| 61 | + * in the majority of cases, the mask is forced to *nomask* when no value is actually masked. A notable exception is when a masked array (with no masked values) has just been unpickled. |
| 62 | + * I got rid of the *share_mask* flag, I never understood its purpose. |
| 63 | + * *put*, *putmask* and *take* now mimic the ndarray methods, to avoid unpleasant surprises. Moreover, *put* and *putmask* both update the mask when needed. * if *a* is a masked array, *bool(a)* raises a *ValueError*, as it does with ndarrays. |
| 64 | + * in the same way, the comparison of two masked arrays is a masked array, not a boolean |
| 65 | + * *filled(a)* returns an array of the same subclass as *a._data*, and no test is performed on whether it is contiguous or not. |
| 66 | + * the mask is always printed, even if it's *nomask*, which makes things easy (for me at least) to remember that a masked array is used. |
| 67 | + * *cumsum* works as if the *_data* array was filled with 0. The mask is preserved, but not updated. |
| 68 | + * *cumprod* works as if the *_data* array was filled with 1. The mask is preserved, but not updated. |
| 69 | + |
| 70 | +New features |
| 71 | +------------ |
| 72 | + |
| 73 | +This list is non-exhaustive... |
| 74 | + |
| 75 | + * the *mr_* function mimics *r_* for masked arrays. |
| 76 | + * the *anom* method returns the anomalies (deviations from the average) |
| 77 | + |
| 78 | +Using the new package with numpy.core.ma |
| 79 | +---------------------------------------- |
| 80 | + |
| 81 | +I tried to make sure that the new package can understand old masked |
| 82 | +arrays. Unfortunately, there's no upward compatibility. |
| 83 | + |
| 84 | +For example: |
| 85 | + |
| 86 | +>>> import numpy.core.ma as old_ma |
| 87 | +>>> import maskedarray as new_ma |
| 88 | +>>> x = old_ma.array([1,2,3,4,5], mask=[0,0,1,0,0]) |
| 89 | +>>> x |
| 90 | +array(data = |
| 91 | + [ 1 2 999999 4 5], |
| 92 | + mask = |
| 93 | + [False False True False False], |
| 94 | + fill_value=999999) |
| 95 | +>>> y = new_ma.array([1,2,3,4,5], mask=[0,0,1,0,0]) |
| 96 | +>>> y |
| 97 | +array(data = [1 2 -- 4 5], |
| 98 | + mask = [False False True False False], |
| 99 | + fill_value=999999) |
| 100 | +>>> x==y |
| 101 | +array(data = |
| 102 | + [True True True True True], |
| 103 | + mask = |
| 104 | + [False False True False False], |
| 105 | + fill_value=?) |
| 106 | +>>> old_ma.getmask(x) == new_ma.getmask(x) |
| 107 | +array([True, True, True, True, True]) |
| 108 | +>>> old_ma.getmask(y) == new_ma.getmask(y) |
| 109 | +array([True, True, False, True, True]) |
| 110 | +>>> old_ma.getmask(y) |
| 111 | +False |
| 112 | + |
| 113 | + |
| 114 | +Using maskedarray with matplotlib |
| 115 | +--------------------------------- |
| 116 | + |
| 117 | +Starting with matplotlib 0.91.2, the masked array importing will work with |
| 118 | +the maskedarray branch) as well as with earlier versions. |
| 119 | + |
| 120 | +By default matplotlib still uses numpy.ma, but there is an rcParams setting |
| 121 | +that you can use to select maskedarray instead. In the matplotlibrc file |
| 122 | +you will find:: |
| 123 | + |
| 124 | + #maskedarray : False # True to use external maskedarray module |
| 125 | + # instead of numpy.ma; this is a temporary # |
| 126 | + setting for testing maskedarray. |
| 127 | + |
| 128 | + |
| 129 | +Uncomment and set to True to select maskedarray everywhere. |
| 130 | +Alternatively, you can test a script with maskedarray by using a |
| 131 | +command-line option, e.g.:: |
| 132 | + |
| 133 | + python simple_plot.py --maskedarray |
| 134 | + |
| 135 | + |
| 136 | +Masked records |
| 137 | +-------------- |
| 138 | + |
| 139 | +Like *numpy.core.ma*, the *ndarray*-based implementation |
| 140 | +of MaskedArray is limited when working with records: you can |
| 141 | +mask any record of the array, but not a field in a record. If you |
| 142 | +need this feature, you may want to give the *mrecords* package |
| 143 | +a try (available in the *maskedarray* directory in the scipy |
| 144 | +sandbox). This module defines a new class, *MaskedRecord*. An |
| 145 | +instance of this class accepts a *recarray* as data, and uses two |
| 146 | +masks: the *fieldmask* has as many entries as records in the array, |
| 147 | +each entry with the same fields as a record, but of boolean types: |
| 148 | +they indicate whether the field is masked or not; a record entry |
| 149 | +is flagged as masked in the *mask* array if all the fields are |
| 150 | +masked. A few examples in the file should give you an idea of what |
| 151 | +can be done. Note that *mrecords* is still experimental... |
| 152 | + |
| 153 | +Optimizing maskedarray |
| 154 | +---------------------- |
| 155 | + |
| 156 | +Should masked arrays be filled before processing or not? |
| 157 | +-------------------------------------------------------- |
| 158 | + |
| 159 | +In the current implementation, most operations on masked arrays involve |
| 160 | +the following steps: |
| 161 | + |
| 162 | + * the input arrays are filled |
| 163 | + * the operation is performed on the filled arrays |
| 164 | + * the mask is set for the results, from the combination of the input masks and the mask corresponding to the domain of the operation. |
| 165 | + |
| 166 | +For example, consider the division of two masked arrays:: |
| 167 | + |
| 168 | + import numpy |
| 169 | + import maskedarray as ma |
| 170 | + x = ma.array([1,2,3,4],mask=[1,0,0,0], dtype=numpy.float_) |
| 171 | + y = ma.array([-1,0,1,2], mask=[0,0,0,1], dtype=numpy.float_) |
| 172 | + |
| 173 | +The division of x by y is then computed as:: |
| 174 | + |
| 175 | + d1 = x.filled(0) # d1 = array([0., 2., 3., 4.]) |
| 176 | + d2 = y.filled(1) # array([-1., 0., 1., 1.]) |
| 177 | + m = ma.mask_or(ma.getmask(x), ma.getmask(y)) # m = |
| 178 | + array([True,False,False,True]) |
| 179 | + dm = ma.divide.domain(d1,d2) # array([False, True, False, False]) |
| 180 | + result = (d1/d2).view(MaskedArray) # masked_array([-0. inf, 3., 4.]) |
| 181 | + result._mask = logical_or(m, dm) |
| 182 | + |
| 183 | +Note that a division by zero takes place. To avoid it, we can consider |
| 184 | +to fill the input arrays, taking the domain mask into account, so that:: |
| 185 | + |
| 186 | + d1 = x._data.copy() # d1 = array([1., 2., 3., 4.]) |
| 187 | + d2 = y._data.copy() # array([-1., 0., 1., 2.]) |
| 188 | + dm = ma.divide.domain(d1,d2) # array([False, True, False, False]) |
| 189 | + numpy.putmask(d2, dm, 1) # d2 = array([-1., 1., 1., 2.]) |
| 190 | + m = ma.mask_or(ma.getmask(x), ma.getmask(y)) # m = |
| 191 | + array([True,False,False,True]) |
| 192 | + result = (d1/d2).view(MaskedArray) # masked_array([-1. 0., 3., 2.]) |
| 193 | + result._mask = logical_or(m, dm) |
| 194 | + |
| 195 | +Note that the *.copy()* is required to avoid updating the inputs with |
| 196 | +*putmask*. The *.filled()* method also involves a *.copy()*. |
| 197 | + |
| 198 | +A third possibility consists in avoid filling the arrays:: |
| 199 | + |
| 200 | + d1 = x._data # d1 = array([1., 2., 3., 4.]) |
| 201 | + d2 = y._data # array([-1., 0., 1., 2.]) |
| 202 | + dm = ma.divide.domain(d1,d2) # array([False, True, False, False]) |
| 203 | + m = ma.mask_or(ma.getmask(x), ma.getmask(y)) # m = |
| 204 | + array([True,False,False,True]) |
| 205 | + result = (d1/d2).view(MaskedArray) # masked_array([-1. inf, 3., 2.]) |
| 206 | + result._mask = logical_or(m, dm) |
| 207 | + |
| 208 | +Note that here again the division by zero takes place. |
| 209 | + |
| 210 | +A quick benchmark gives the following results: |
| 211 | + |
| 212 | + * *numpy.ma.divide* : 2.69 ms per loop |
| 213 | + * classical division : 2.21 ms per loop |
| 214 | + * division w/ prefilling : 2.34 ms per loop |
| 215 | + * division w/o filling : 1.55 ms per loop |
| 216 | + |
| 217 | +So, is it worth filling the arrays beforehand ? Yes, if we are interested |
| 218 | +in avoiding floating-point exceptions that may fill the result with infs |
| 219 | +and nans. No, if we are only interested into speed... |
| 220 | + |
| 221 | + |
| 222 | +Thanks |
| 223 | +------ |
| 224 | + |
| 225 | +I'd like to thank Paul Dubois, Travis Oliphant and Sasha for the |
| 226 | +original masked array package: without you, I would never have started |
| 227 | +that (it might be argued that I shouldn't have anyway, but that's |
| 228 | +another story...). I also wish to extend these thanks to Reggie Dugard |
| 229 | +and Eric Firing for their suggestions and numerous improvements. |
| 230 | + |
| 231 | + |
| 232 | +Revision notes |
| 233 | +-------------- |
| 234 | + |
| 235 | + * 08/25/2007 : Creation of this page |
| 236 | + * 01/23/2007 : The package has been moved to the SciPy sandbox, and is regularly updated: please check out your SVN version! |
0 commit comments