How can I remove junk characters with regex? -
i have web application reads contents of web page , parses sentences using nlp algorithm. have been using regex split contents single sentences , parsing them.
i remove characters Â
sentences. these characters, imagine, because of html encoding.
i cannot use regex [^\w\d]+
or variations because need punctuations intact. of course add individual exceptions each of punctuation [^\w\d\.,:]+
, on, if there easier way this, character class knows a... funny character?
any appreciated. thanks.
edit: app built php , using simple file_get_contents()
fetch html data site , reading contents inside <p>
tags.
this mentioned in comments @thegreatco able create character class of "special" characters. can use hex code values create range in character class. special character on ascii 127 this.
[\x80-\xfe]
that match basic characters. reference sake, here's list of ascii character table hex codes.
this page discusses different ways can reference special characters in regex.
Comments
Post a Comment