Hi guys,
I want to extract a string from a bunch of text (here a selection for example). This text is xml tagged.
If I do selection[0].contents, it captures the text and all the special characters (XML tags, carriage return). I can check something is "wrong" cause contents.length is greater than expected (John(space)Smith > 10 characters but contents.length > 14). I am not really surprised cause I knew this behaviour.
So I tried to filter it to remove any content which is not an alphanumeric character but here is where I fail.
If I use GREP with contents.match(/[\w]+/g), it's quite perfect. But if the contents has diacritics, this pattern fails to catch them.
So I could include them in the pattern but it's really probable I miss a lot.
So my question is "how to extract the pure text from the contents ensuring I get all the diacritics if any but without carrying special characters ?