python - How should I go about subsampling from a scipy.sparse.csr.csr_matrix and a list -
i have scipy.sparse.csr.csr_matrix
represents words in document , list of lists each index represents categories each index in matrix.
the problem having need randomly select n amount of rows data.
so if matrix looks this
[1:3 2:3 4:4] [1:5 2:5 5:4]
and list of lists looked this
((20,40) (80,50))
and needed sample 1 value end this
[1:3 2:3 4:4] ((20,40))
i have searched scipy documentation cannot find way generate new csr matrix using list of indexes.
you can index csr matrix using list of indices. first create matrix, , @ it:
>>> m = csr_matrix([[0,0,1,0], [4,3,0,0], [3,0,0,8]]) <3x4 sparse matrix of type '<type 'numpy.int64'>' 5 stored elements in compressed sparse row format> >>> print m.toarray() [[0 0 1 0] [4 3 0 0] [3 0 0 8]]
of course, can first row:
>>> m[0] <1x4 sparse matrix of type '<type 'numpy.int64'>' 1 stored elements in compressed sparse row format> >>> print m[0].toarray() [[0 0 1 0]]
but can @ first , third row @ once using list [0,2]
index:
>>> m[[0,2]] <2x4 sparse matrix of type '<type 'numpy.int64'>' 3 stored elements in compressed sparse row format> >>> print m[[0,2]].toarray() [[0 0 1 0] [3 0 0 8]]
now can generate n
random indices no repeats (no replacement) using numpy's choice
:
i = np.random.choice(np.arange(m.shape[0]), n, replace=false)
then can grab indices both original matrix m
:
sub_m = m[i]
to grab them categories list of lists, must first make array, can index list i
:
sub_c = np.asarray(categories)[i]
if want have list of lists back, use:
sub_c.tolist()
or, if have/want tuple of tuples, think have manually:
tuple(map(tuple, sub_c))
Comments
Post a Comment