python - How to construct regex for this text -


here's input:

7. data 1 1. str1 str2 3. 12345 4. 0876 9. no 2 1. str 2. strt str 3. 9909090 5. yes 6. no 7. yes 8. no 9. yes 10. 5000 xx 11. 1000 zŁ 12. no prub. 1 1. 1000 xx 2. no 3 1. strt 2. strt 3. 63110300291 5. yes 6. no 7. no 8. no 9. yes 10. 5000 xx 11. 1000 zŁ 12. no prub. 1 1. 1000 xx 2. no 4 1. qweret 2. iostr9 3. 76012509879 5. yes 6. no 7. no 8. no 9. yes 10. 5000 xx 11. 1000 xx 12. no prub. 1 1. 1000 xx 2. no 0 1. 

and here's expected output:

[('1', '1. str1 str2 3. 12345 4. 0876 9. no'), ('2', '1. str 2. strt str 3. 9909090 5. yes 6. no 7. yes 8. no 9. yes 10. 5000 xx 11. 1000 zŁ 12. no prub. 1 1. 1000 xx 2. no'), ('3', '1. strt 2. strt 3. 63110300291 5. yes 6. no 7. no 8. no 9. yes 10. 5000 xx 11. 1000 zŁ 12. no prub. 1 1. 1000 xx 2. no'), ('4', '1. qweret 2. iostr9 3. 76012509879 5. yes 6. no 7. no 8. no 9. yes 10. 5000 xx 11. 1000 xx 12. no prub. 1 1. 1000 xx 2. no')] 

i've tried this:

re.findall(r'(?=\s(\d+)\s(1\..*?)\s\d+\s1\.)', txt, re.dotall) 

but of course it's not right solution - regex have match (\d+) 1. not prub. 1 1..
should make work?

how this:

in [1]: s='7. data 1 1. str1 str2 3. 12345 4. 0876 9. no 2 1. str 2. strt str 3. 9909090 5. yes 6. no 7. yes 8. no 9. yes 10. 5000 xx 11. 1000 zŁ 12. no prub. 1 1. 1000 xx 2. no 3 1. strt 2. strt 3. 63110300291 5. yes 6. no 7. no 8. no 9. yes 10. 5000 xx 11. 1000 zŁ 12. no prub. 1 1. 1000 xx 2. no 4 1. qweret 2. iostr9 3. 76012509879 5. yes 6. no 7. no 8. no 9. yes 10. 5000 xx 11. 1000 xx 12. no prub. 1 1. 1000 xx 2. no 0 1.'  in [2]: import re  in [3]: re.findall('(?<=\s)\d.*?(?=\s\d\s\d[.](?=$|\s[a-z]))',s) out[3]:  ['1 1. str1 str2 3. 12345 4. 0876 9. no',  '2 1. str 2. strt str 3. 9909090 5. yes 6. no 7. yes 8. no 9. yes 10. 5000 xx 11. 1000 z\xc5\x81 12. no prub. 1 1. 1000 xx 2. no',  '3 1. strt 2. strt 3. 63110300291 5. yes 6. no 7. no 8. no 9. yes 10. 5000 xx 11. 1000 z\xc5\x81 12. no prub. 1 1. 1000 xx 2. no',  '4 1. qweret 2. iostr9 3. 76012509879 5. yes 6. no 7. no 8. no 9. yes 10. 5000 xx 11. 1000 xx 12. no prub. 1 1. 1000 xx 2. no'] 

for exact output i'd like:

in [4]: ns = re.findall('(?<=\s)\d.*?(?=\s\d\s\d[.](?=$|\s[a-z]))',s)  in [5]: [tuple(f.split(' ',1)) f in ns] out[5]:  [('1', '1. str1 str2 3. 12345 4. 0876 9. no'),  ('2', '1. str 2. strt str 3. 9909090 5. yes 6. no 7. yes 8. no 9. yes 10. 5000 xx 11. 1000 z\xc5\x81 12. no prub. 1 1. 1000 xx 2. no'),  ('3', '1. strt 2. strt 3. 63110300291 5. yes 6. no 7. no 8. no 9. yes 10. 5000 xx 11. 1000 z\xc5\x81 12. no prub. 1 1. 1000 xx 2. no'),  ('4', '1. qweret 2. iostr9 3. 76012509879 5. yes 6. no 7. no 8. no 9. yes 10. 5000 xx 11. 1000 xx 12. no prub. 1 1. 1000 xx 2. no')] 

might better way python foo isn't regexp foo.

regexplanation:

(?<=\s) # use positive look-behind match leading space don't include \d      # match digit     .*?     # match till next record (lazy)         # following positive look-behinds key. matches start of         # each new record i.e         # 2 1. s         # 3 1. s         # 4 1. q         # 0 1.$          # look-arounds match don't seek past.   (?=\s\d\s\d[.](?=$|\s[a-z])) (?=     # positive look-ahead 1 \s      # space \d      # digit \s      # space \d      # digit [.]     # period (?=     # postive look-ahead 2  $       # end of string |       # or \s[a-z] # space followed uppercase letter )       # close look-ahead 1 )       # close look-ahead 2 

Comments

Popular posts from this blog

Why does Ruby on Rails generate add a blank line to the end of a file? -

keyboard - Smiles and long press feature in Android -

node.js - Bad Request - node js ajax post -