diff - Implementing Google's DiffMatchPatch API for Python 2/3 -

i want write simple diff application in python using google's diff match patch apis. i'm quite new python, want example of how use diff match patch api semantically comparing 2 paragraphs of text. i'm not sure of how go using diff_match_patch.py file , import it. appreciated!

additionally, i've tried using difflib, found ineffective comparing largely varied sentences. i'm using ubuntu 12.04 x64.

google's diff-match-patch api same languages implemented in (java, javascript, dart, c++, c#, objective c, lua , python 2.x or python 3.x). therefore 1 can typically use sample snippets in languages other one's target language figure out particular api calls needed various diff/match/patch tasks .

in case of simple "semantic" comparison need

import diff_match_patch  texta = "the cat in red hat" textb = "the feline in blue hat"  #create diff_match_patch object dmp = diff_match_patch.diff_match_patch()  # depending on kind of text work with, in term of overall length # , complexity, may want extend (or here suppress) # time_out feature dmp.diff_timeout = 0   # or other value, default 1.0 seconds  # 'diff' jobs start invoking diff_main() diffs = dmp.diff_main(texta, textb)  # diff_cleanupsemantic() used make diffs array more "human" readable dmp.diff_cleanupsemantic(diffs)  # , if want results ready display hmtl snippet htmlsnippet = dmp.diff_prettyhtml(diffs)

a word on "semantic" processing diff-match-patch
beware such processing useful present differences human viewer because tends produce shorter list of differences avoiding non-relevant resynchronization of texts (when example 2 distinct words happen have common letters in mid). results produced far perfect, processing simple heuristics based on length of differences , surface patterns etc. rather actual nlp processing based on lexicons , other semantic-level devices.
example, texta , textb values used above produce following "before-and-after-diff_cleanupsemantic" values diffs array

[(0, 'the '), (-1, 'cat'), (1, 'feline'), (0, ' in '), (-1, 'r'), (1, 'blu'), (0, 'e'), (-1, 'd'), (0, ' hat')] [(0, 'the '), (-1, 'cat'), (1, 'feline'), (0, ' in '), (-1, 'red'), (1, 'blue'), (0, ' hat')]

nice! letter 'e' common red , blue causes diff_main() see area of text 4 edits, cleanupsemantic() fixes 2 edits, nicely singling out different sems 'blue' , 'red'.

however, if have, example

texta = "stackoverflow cool" textb = "so cool"

the before/after arrays produced are:

[(0, 's'), (-1, 'tack'), (0, 'o'), (-1, 'verflow'), (0, ' is'), (1, ' very'), (0, ' cool')] [(0, 's'), (-1, 'tackoverflow is'), (1, 'o very'), (0, ' cool')]

which shows allegedly semantically improved after can rather unduly "tortured" compared before. note, example, how leading 's' kept match , how added 'very' word mixed parts of 'is cool' expression. ideally, we'd expect

[(-1, 'stackoverflow'), (1, 'so'), (0, ' '), (-1, 'very'), (0, ' cool')]

Search This Blog

Babette

diff - Implementing Google's DiffMatchPatch API for Python 2/3 -

Comments

Post a Comment

Popular posts from this blog

node.js - Bad Request - node js ajax post -

Why does Ruby on Rails generate add a blank line to the end of a file? -

keyboard - Smiles and long press feature in Android -