Our next paper — OpenEDGAR – Open Source Software for SEC Edgar Analysis is now available. This paper explores a range of #OpenSource tools we have developed to explore the EDGAR system operated by the US Securities and Exchange Commission (SEC). While a range of more sophisticated extraction and clause classification protocols can be developed leveraging LexNLP and other open and closed source tools, we provide some very simple code examples as an illustrative starting point.
Click here for Paper: < SSRN > < arXiv >
Access Codebase Here: < Github >
Abstract: OpenEDGAR is an open source Python framework designed to rapidly construct research databases based on the Electronic Data Gathering, Analysis, and Retrieval (EDGAR) system operated by the US Securities and Exchange Commission (SEC). OpenEDGAR is built on the Django application framework, supports distributed compute across one or more servers, and includes functionality to (i) retrieve and parse index and filing data from EDGAR, (ii) build tables for key metadata like form type and filer, (iii) retrieve, parse, and update CIK to ticker and industry mappings, (iv) extract content and metadata from filing documents, and (v) search filing document contents. OpenEDGAR is designed for use in both academic research and industrial applications, and is distributed under MIT License at https://github.com/LexPredict/openedgar
Paper Abstract – LexNLP is an open source Python package focused on natural language processing and machine learning for legal and regulatory text. The package includes functionality to (i) segment documents, (ii) identify key text such as titles and section headings, (iii) extract over eighteen types of structured information like distances and dates, (iv) extract named entities such as companies and geopolitical entities, (v) transform text into features for model training, and (vi) build unsupervised and supervised models such as word embedding or tagging models. LexNLP includes pre-trained models based on thousands of unit tests drawn from real documents available from the SEC EDGAR database as well as various judicial and regulatory proceedings. LexNLP is designed for use in both academic research and industrial applications, and is distributed at https://github.com/LexPredict/lexpredict-lexnlp
We are announcing a new open source offering – OpenEDGAR, for building databases using the #SEC #EDGAR database. Press release here ! See you on Github.
On August 1, we released Contrax Suite (an open source document analytics platform). It is important to note that we have decided upon dual licensing – (1) open source (AGPL) which is pretty hard core copyleft and (2) a more permissive license in specific circumstances. The key for us is to maintain the opensource ecosystem which requires balancing competing interests. We cannot grant the more permissive license to everyone under all conditions or it undermines the entire effort.
That said, we have a real problems in the A.I. + Law community. Some of the claims are outlandish and the business model (at its core) does not really make sense. We think that opensource helps solve for some (perhaps not all) of the adoption issues.
From the release: “At their core, many academic and commercial applications of natural language processing and machine learning can benefit from a controlled lexicon of expert-selected terms (i.e., a dictionary). This is especially true of highly technical language, such as legal text. However, after a search of the existing landscape, we were unable to find a high-quality open source or freely-available legal dictionary. Instead, the best existing versions, when available, exist under some form of restrictive licensing conditions.”
“Thus, in furtherance of both the legal profession as well as a range of legal technology providers and solutions, we are announcing another step in our broader open source plan that we outlined earlier this month. Namely, we are making available on Github the 1910 Version of Black’s Law (i.e., Black’s Law 2nd Edition) as a structured data object. This early version of arguably the premier legal dictionary is made available under the open source GPL license 3.0 which should allow both researchers and commercial providers to operate with limited restrictions.”
Click here to access the GitHub Repo.
From the article – “We are increasingly thinking that there’s room in legal tech for a Red Hat in legal — companies that really focus on development of software by providing wraparound services, but offer their software open source,” Michael J Bommarito II said.
For more information check out our announcement and the slidedeck (which has more details).
Following up on our prior announcement – here is a slidedeck offering more Product Overview, Use Case and Plan for Release.
Obviously this move is pretty significant for those trying to sell machine learning in a SAAS style model / machine learning as a service (ML_AAS). Together with the significant amount of ML technology that is already in the opensource ecosystem – this will put more pressure on customization / configuration around problems with a much smaller premium on having access to certain forms of base models/algorithms.