parsing - Split text in list of lists depending on delimiter, in one pass -

- July 15, 2011

i've large ascii delimited text file looks this:

10000\x1f4959\x1f\4567\x1f\x1f\x1e20000\x1f456\x1f456\x1f\x1f\x1e...

the desired result list of lists like:

[[10000,4959,4567],[20000,456,456],...]

i can in 2 passes, first using text.split('\x1e') , using loop split each sublist on '\x1f'.

but there way achieve same result in 1 pass?

assumptions

"...[i]s there way achieve same result in 1 pass?"
- i assume mean iterating on contents of "large ascii delimited text file" once.
i further assume posted snippet representative sample of data.

overview

you have 3 things here in nested structure:

the strings (the records)
the sublists (lists of strings delimited record separator)
the outer list (the list of sublists given large file)

based on sample data, both strings , sublists short, should able relatively expensive operations on them without performance loss. it's outer list want optimize for, , want make sure pass on huge ascii text file once.

the algorithm

understand there more 1 way this, , i'm not claiming powerful, efficient or expressive way. ymmv depending on how assumptions match reality. generates (read generators in excellent question) sublists (which not generated since short). processing generator yield outer list, while processing text once.

>>> def generate_sublists(text, outer_delim='\x1e', inner_delim='\x1f'): ...   sublist = [] ...   list_item = '' ...   character in text:  # comb on text once (one loop) ...     if character == inner_delim: ...       # when hit inner delimiter,  ...       # push string onto list ...       sublist.append(list_item) ...       # , reset string placeholder empty ...       list_item = '' ...     elif character == outer_delim: ...       # when hit outer delimiter, generate sublist ...       yield sublist ...       # , reset sublist placeholder empty ...       sublist = [] ...     else: ...       # other character add onto string placeholder ...       list_item += character ...  >>> # sample data provided >>> text = '10000\x1f4959\x1f\4567\x1f\x1f\x1e20000\x1f456\x1f456\x1f\x1f\x1e' >>> outer_list = [] >>> sublist in generate_sublists(text): ...   outer_list.append(sublist) ...  >>> outer_list [['10000', '4959', '.7', ''], ['20000', '456', '456', '']]

wait, why doesn't match expected output?

there oddities in sample data posted. example, record separator delmiter appears twice ('\x1f\x1f'). algorithm treats empty record, while expected output leaves out. 1 possible fix filter output. in other words:

>>> outer_list = [] >>> sublist in generate_sublists(text): ...   outer_list.append(filter(bool, sublist)) ... >>> outer_list [['10000', '4959', '.7'], ['20000', '456', '456']]

again, since sublists short, shouldn't add processing time. if does, add check condition before sublist.append(list_item):

...   character in text: ...     if character == inner_delim: ...       if not list_item: ...         continue ...        sublist.append(list_item)

you've noticed in output data have .7 have 4567. that's because in example input have 4959\x1f\4567 (note backslash -- maybe typo?). backslash causes \456 interpreted octal number. using python can decipher this:

>>> 0o456 302 >>> 302 % 256 46

\456 in octal same 302 in decimal, valid range 0-256, have apply modulo see value becomes: 46 (.).

i assume typo, since don't expect in output, though.

Search This Blog

XPATH