diff - Implementing Google's DiffMatchPatch API for Python 2/3 -
i want write simple diff application in python using google's diff match patch apis. i'm quite new python, want example of how use diff match patch api semantically comparing 2 paragraphs of text. i'm not sure of how go using diff_match_patch.py
file , import it. appreciated!
additionally, i've tried using difflib, found ineffective comparing largely varied sentences. i'm using ubuntu 12.04 x64.
google's diff-match-patch api same languages implemented in (java, javascript, dart, c++, c#, objective c, lua , python 2.x or python 3.x). therefore 1 can typically use sample snippets in languages other one's target language figure out particular api calls needed various diff/match/patch tasks .
in case of simple "semantic" comparison need
import diff_match_patch texta = "the cat in red hat" textb = "the feline in blue hat" #create diff_match_patch object dmp = diff_match_patch.diff_match_patch() # depending on kind of text work with, in term of overall length # , complexity, may want extend (or here suppress) # time_out feature dmp.diff_timeout = 0 # or other value, default 1.0 seconds # 'diff' jobs start invoking diff_main() diffs = dmp.diff_main(texta, textb) # diff_cleanupsemantic() used make diffs array more "human" readable dmp.diff_cleanupsemantic(diffs) # , if want results ready display hmtl snippet htmlsnippet = dmp.diff_prettyhtml(diffs)
a word on "semantic" processing diff-match-patch
beware such processing useful present differences human viewer because tends produce shorter list of differences avoiding non-relevant resynchronization of texts (when example 2 distinct words happen have common letters in mid). results produced far perfect, processing simple heuristics based on length of differences , surface patterns etc. rather actual nlp processing based on lexicons , other semantic-level devices.
example, texta
, textb
values used above produce following "before-and-after-diff_cleanupsemantic" values diffs
array
[(0, 'the '), (-1, 'cat'), (1, 'feline'), (0, ' in '), (-1, 'r'), (1, 'blu'), (0, 'e'), (-1, 'd'), (0, ' hat')] [(0, 'the '), (-1, 'cat'), (1, 'feline'), (0, ' in '), (-1, 'red'), (1, 'blue'), (0, ' hat')]
nice! letter 'e' common red , blue causes diff_main() see area of text 4 edits, cleanupsemantic() fixes 2 edits, nicely singling out different sems 'blue' , 'red'.
however, if have, example
texta = "stackoverflow cool" textb = "so cool"
the before/after arrays produced are:
[(0, 's'), (-1, 'tack'), (0, 'o'), (-1, 'verflow'), (0, ' is'), (1, ' very'), (0, ' cool')] [(0, 's'), (-1, 'tackoverflow is'), (1, 'o very'), (0, ' cool')]
which shows allegedly semantically improved after can rather unduly "tortured" compared before. note, example, how leading 's' kept match , how added 'very' word mixed parts of 'is cool' expression. ideally, we'd expect
[(-1, 'stackoverflow'), (1, 'so'), (0, ' '), (-1, 'very'), (0, ' cool')]
Comments
Post a Comment