03_html

HTML modification

prerequisites

hyphenate a beautiful soup

hyphenate_soup

 hyphenate_soup (soup:bs4.BeautifulSoup,
                 hyphenator:collections.abc.Callable[[str],str], exclude_c
                 lasses:tuple[typing.Type[bs4.element.PageElement],...]=(<
                 class 'bs4.element.PreformattedString'>, <class
                 'bs4.element.Stylesheet'>, <class 'bs4.element.Script'>,
                 <class 'bs4.element.RubyTextString'>, <class
                 'bs4.element.RubyParenthesisString'>))

Call hyphenator on words that appear in suitable elements of soup, and replace the contents of those elements. Suitable elements are those containing text whose class is not (a subclass of something) in exclude_classes.

	Type	Default	Details
soup	BeautifulSoup		soup to be modified
hyphenator	Callable		hyphenator
exclude_classes	tuple	(<class ‘bs4.element.PreformattedString’>, <class ‘bs4.element.Stylesheet’>, <class ‘bs4.element.Script’>, <class ‘bs4.element.RubyTextString’>, <class ‘bs4.element.RubyParenthesisString’>)	do not modify inside these
Returns	None

soup analysis

The following function consumes many beautiful soups and counts the words in them. Ways to use it include: - list all the words longer than a threshold and see if they are compounds - list all the frequent words and see if they are hyphenated right - list all the words of low frequency and see if they are typos

source

analyze_soups

 analyze_soups (dinner:list[bs4.BeautifulSoup],
                exclude_classes:tuple[typing.Type[bs4.element.PageElement]
                ,...]=(<class 'bs4.element.PreformattedString'>, <class
                'bs4.element.Stylesheet'>, <class 'bs4.element.Script'>,
                <class 'bs4.element.RubyTextString'>, <class
                'bs4.element.RubyParenthesisString'>))

Count words appearing in all soups.

	Type	Default	Details
dinner	list		soups to be read
exclude_classes	tuple	(<class ‘bs4.element.PreformattedString’>, <class ‘bs4.element.Stylesheet’>, <class ‘bs4.element.Script’>, <class ‘bs4.element.RubyTextString’>, <class ‘bs4.element.RubyParenthesisString’>)	do not look inside these
Returns	Counter

It can be interesting to look at the longest words on some pages, perhaps to check if hyphenation exceptions are needed for them:

soups = [
    bs4.BeautifulSoup(r.content, 'lxml')
    for r in map(requests.get, [
        'https://en.wikipedia.org/wiki/Syllabification',
        'https://en.wikipedia.org/wiki/Hyphen'
    ])
]
c = analyze_soups(soups)

[(w, c[w]) for w in c.keys() if len(w)>13]

[('syllabification', 19),
 ('conventionally', 2),
 ('correspondence', 1),
 ('implementation', 1),
 ('heterosyllabic', 1),
 ('disambiguation', 2),
 ('prescriptivist', 1),
 ('recommendations', 2),
 ('representatives', 1),
 ('misinterpretable', 1),
 ('misunderstanding', 1),
 ('misinterpreted', 2),
 ('past_participled', 1),
 ('differentiated', 1),
 ('recommendation', 1),
 ('sociolinguistics', 2),
 ('reduplicatives', 1),
 ('implementations', 1),
 ('indistinguishable', 1),
 ('standardization', 1),
 ('reinterpretations', 1),
 ('interpretation', 1),
 ('capitalization', 1),
 ('classifications', 1),
 ('microtypography', 1),
 ('phototypesetting', 1),
 ('classification', 1),
 ('multiplication', 1),
 ('srpskohrvatski', 1),
 ('српскохрватски', 1)]

Or the most common words:

c.most_common(20)

[('the', 339),
 ('a', 215),
 ('of', 190),
 ('in', 174),
 ('and', 159),
 ('is', 135),
 ('to', 132),
 ('hyphen', 131),
 ('for', 84),
 ('as', 79),
 ('or', 75),
 ('be', 65),
 ('are', 61),
 ('this', 57),
 ('that', 56),
 ('with', 55),
 ('used', 53),
 ('hyphenation', 46),
 ('hyphens', 45),
 ('word', 41)]