soups = [
bs4.BeautifulSoup(r.content, 'lxml')
for r in map(requests.get, [
'https://en.wikipedia.org/wiki/Syllabification',
'https://en.wikipedia.org/wiki/Hyphen'
])
]
c = analyze_soups(soups)03_html
HTML modification
prerequisites
hyphenate a beautiful soup
hyphenate_soup
hyphenate_soup (soup:bs4.BeautifulSoup, hyphenator:collections.abc.Callable[[str],str], exclude_c lasses:tuple[typing.Type[bs4.element.PageElement],...]=(< class 'bs4.element.PreformattedString'>, <class 'bs4.element.Stylesheet'>, <class 'bs4.element.Script'>, <class 'bs4.element.RubyTextString'>, <class 'bs4.element.RubyParenthesisString'>))
Call hyphenator on words that appear in suitable elements of soup, and replace the contents of those elements. Suitable elements are those containing text whose class is not (a subclass of something) in exclude_classes.
| Type | Default | Details | |
|---|---|---|---|
| soup | BeautifulSoup | soup to be modified | |
| hyphenator | Callable | hyphenator | |
| exclude_classes | tuple | (<class ‘bs4.element.PreformattedString’>, <class ‘bs4.element.Stylesheet’>, <class ‘bs4.element.Script’>, <class ‘bs4.element.RubyTextString’>, <class ‘bs4.element.RubyParenthesisString’>) | do not modify inside these |
| Returns | None |
soup analysis
The following function consumes many beautiful soups and counts the words in them. Ways to use it include: - list all the words longer than a threshold and see if they are compounds - list all the frequent words and see if they are hyphenated right - list all the words of low frequency and see if they are typos
analyze_soups
analyze_soups (dinner:list[bs4.BeautifulSoup], exclude_classes:tuple[typing.Type[bs4.element.PageElement] ,...]=(<class 'bs4.element.PreformattedString'>, <class 'bs4.element.Stylesheet'>, <class 'bs4.element.Script'>, <class 'bs4.element.RubyTextString'>, <class 'bs4.element.RubyParenthesisString'>))
Count words appearing in all soups.
| Type | Default | Details | |
|---|---|---|---|
| dinner | list | soups to be read | |
| exclude_classes | tuple | (<class ‘bs4.element.PreformattedString’>, <class ‘bs4.element.Stylesheet’>, <class ‘bs4.element.Script’>, <class ‘bs4.element.RubyTextString’>, <class ‘bs4.element.RubyParenthesisString’>) | do not look inside these |
| Returns | Counter |
It can be interesting to look at the longest words on some pages, perhaps to check if hyphenation exceptions are needed for them:
[(w, c[w]) for w in c.keys() if len(w)>13][('syllabification', 19),
('conventionally', 2),
('correspondence', 1),
('implementation', 1),
('heterosyllabic', 1),
('disambiguation', 2),
('prescriptivist', 1),
('recommendations', 2),
('representatives', 1),
('misinterpretable', 1),
('misunderstanding', 1),
('misinterpreted', 2),
('past_participled', 1),
('differentiated', 1),
('recommendation', 1),
('sociolinguistics', 2),
('reduplicatives', 1),
('implementations', 1),
('indistinguishable', 1),
('standardization', 1),
('reinterpretations', 1),
('interpretation', 1),
('capitalization', 1),
('classifications', 1),
('microtypography', 1),
('phototypesetting', 1),
('classification', 1),
('multiplication', 1),
('srpskohrvatski', 1),
('српскохрватски', 1)]
Or the most common words:
c.most_common(20)[('the', 339),
('a', 215),
('of', 190),
('in', 174),
('and', 159),
('is', 135),
('to', 132),
('hyphen', 131),
('for', 84),
('as', 79),
('or', 75),
('be', 65),
('are', 61),
('this', 57),
('that', 56),
('with', 55),
('used', 53),
('hyphenation', 46),
('hyphens', 45),
('word', 41)]