python - Beautifulsoup find element by text using `find_all` no matter if there are elements in it -
for example
bs = beautifulsoup("<html><a>sometext</a></html>") print bs.find_all("a",text=re.compile(r"some")) returns [<a>sometext</a>] when element searched has child, i.e. img
bs = beautifulsoup("<html><a>sometext<img /></a></html>") print bs.find_all("a",text=re.compile(r"some")) it returns []
is there way use find_all match later example?
you need use hybrid approach since text= fail when element has child elements text.
bs = beautifulsoup("<html><a>sometext</a></html>") reg = re.compile(r'some') elements = [e e in bs.find_all('a') if reg.match(e.text)] background
when beautifulsoup searching element, , text callable, eventually calls:
self._matches(found.string, self.text) in 2 examples gave, .string method returns different things:
>>> bs1 = beautifulsoup("<html><a>sometext</a></html>") >>> bs1.find('a').string u'sometext' >>> bs2 = beautifulsoup("<html><a>sometext<img /></a></html>") >>> bs2.find('a').string >>> print bs2.find('a').string none the .string method looks this:
@property def string(self): """convenience property single string within tag. :return: if tag has single string child, return value string. if tag has no children, or more 1 child, return value none. if tag has 1 child tag, return value 'string' attribute of child tag, recursively. """ if len(self.contents) != 1: return none child = self.contents[0] if isinstance(child, navigablestring): return child return child.string if print out contents can see why returns none:
>>> print bs1.find('a').contents [u'sometext'] >>> print bs2.find('a').contents [u'sometext', <img/>]
Comments
Post a Comment