- Python 3.4 support is dropped;
- in case of duplicate OpenGraph definitions (e.g. multiple
og:image), empty results are de-prioritized now, to do the same as Facebook; - text content of microdata attributes is now extracted using html-text library, which fixes badly extracted text in some cases (words glued together, etc.)
- In case of duplicate OpenGraph definitions (e.g. multiple
og:image), extruct now keeps the first one, not the last one, to do the same as Facebook.
- Cover all possible exception cases dealt by
extruct()errorsattribute for valuesstrict,logandignore - avoid including
itempropfrom childitemscopewhen usingitemreffor microdata - proper processing order for
itemreffor microdata
- json-ld parsing issue is fixed;
- deprecation warning for
urlargument points to caller code; - better Python 3.7 support (fixed warnings, setup running 3.7 tests on CI).
In this release OpenGraph parsing is improved:
- known OpenGraph namespaces (og, music, video, article, book, profile) work without an explicitly defined prefix;
- prefix is extracted both from
<head>and<html>element attributes, not only from<head>; - prefix parsing is more permissive.
Other changes:
- pypi version badge is added to the README;
- html parsing code is cleaned up.
- JSON-LD parsing is less strict now: control characters are allowed.
- Add OpenGraph and Microformat extractors.
- Add argument
syntaxestoextractand command line function, it allows to select which syntaxes to extract. - Add argument
uniformtoextractand command line function, if True it maps the output of Microdata, OpenGraph, Microformat and Json-ld to the same template. - Add argument
errorstoextractand command line function, it allows to define if errors should be raised, logged or ignored. - Fix RDFa memory leak, now RDfaExtractor resets
_lookupsafter each extraction. - Fixed regex pattern in
JsonLdExtractorto avoid removing comments from within valid JSON. - In
w3microdatastrip whitespaces, newlines, etc from urls extracted from html nodes. base_urlsubstitutesurlinMicroformatExtractor,JsonLdExtractor,OpenGraphExtractor,RDFaExtractorandMicrodataExtractor- individual extractors accpet
base_urlinstead ofurl, unused keyword arguments are removed. - In
w3microdata.extract_itemsitems_seenandurlare no longer class variables but are passed as arguments. - In
w3microdatathe following functions are now private:extract_item,extract_property_value,extract_textContent,_extract_property,_extract_properties,_extract_property_refsand_extract_textContent. - In
w3microdata_extract_properties,_extract_property_refs,_extract_property,_extract_property_valueand_extract_itemnow needitems_seenandurlto be passed as arguments. - Add argument
return_html_nodetoextract, it allows to return HTML node with the result of metadata extraction. It is supported only by microdata syntax.
Warning: backward-incompatible change:
base_urlis used instead ofurlinextruct.extract,urlis still supported by deprecated.- In
extruct.extractdefaultbase_urlis nowNoneto avoid wrong results withurljoin.
- New
extructcommand line tool to fetch a page and extract its metadata. Works either viaextructdirectly orpython -m extruct. - Accept leading HTML comment in JSON-LD payload.
- rdflib log messages were silenced to avoid the noise when importing extruct.
- Fix dependencies and support RDFa by default (hence depend on rdflib by default).
- Update README with all-in-one extractor examples.
- All extractors have an
.extract_items()method, taking an lxml-parsed document as input, if you want to reuse one you already have. - Add generic extraction: use
extruct.extract()to call all extractors at once.
Warning: backward-incompatible change:
.extract()methods now return a list of Python dicts (the items) instead of a dict with an "items" key having this list as value.
- Use rdflib's pyRdfa directly instead of pyRdfa3 code copy.
- (Very) Experimental support for RDFa extraction using rdflib+lxml
- Web service response content-type set to 'application/json'
- Web service Python 3 compatiblity
- Code coverage reports
- Fix extraction of
<object>"data" URL with microdata - Handle textContent mixed with
<script>and<style>tags - Add JSON-LD extraction example to README
- Tests added for non-nested microdata output
- Tests added for text content option
- Tests added for "meter" and "data" attributes
- First release on PyPI.