How can I remove junk characters with regex? -


i have web application reads contents of web page , parses sentences using nlp algorithm. have been using regex split contents single sentences , parsing them.

i remove characters  sentences. these characters, imagine, because of html encoding.

i cannot use regex [^\w\d]+ or variations because need punctuations intact. of course add individual exceptions each of punctuation [^\w\d\.,:]+ , on, if there easier way this, character class knows a... funny character?

any appreciated. thanks.

edit: app built php , using simple file_get_contents() fetch html data site , reading contents inside <p> tags.

this mentioned in comments @thegreatco able create character class of "special" characters. can use hex code values create range in character class. special character on ascii 127 this.

[\x80-\xfe] 

that match basic characters. reference sake, here's list of ascii character table hex codes.

this page discusses different ways can reference special characters in regex.


Comments

Popular posts from this blog

Why does Ruby on Rails generate add a blank line to the end of a file? -

keyboard - Smiles and long press feature in Android -

node.js - Bad Request - node js ajax post -