dkpro / dkpro-cassis
UIMA CAS processing library written in Python
AI Architecture Analysis
This repository is indexed by RepoMind. By analyzing dkpro/dkpro-cassis in our AI interface, you can instantly generate complete architecture diagrams, visualize control flows, and perform automated security audits across the entire codebase.
Our Agentic Context Augmented Generation (Agentic CAG) engine loads full source files into context on-demand, avoiding the fragmentation of traditional RAG systems. Ask questions about the architecture, dependencies, or specific features to see it in action.
Repository Overview (README excerpt)
Crawler viewdkpro-cassis ============ .. image:: https://github.com/dkpro/dkpro-cassis/actions/workflows/run_tests.yml/badge.svg :target: https://github.com/dkpro/dkpro-cassis/actions/workflows/run_tests.yml .. image:: https://readthedocs.org/projects/cassis/badge/?version=latest :target: https://cassis.readthedocs.io/en/latest/?badge=latest :alt: Documentation Status .. image:: https://codecov.io/gh/dkpro/dkpro-cassis/branch/master/graph/badge.svg :target: https://codecov.io/gh/dkpro/dkpro-cassis .. image:: https://img.shields.io/pypi/l/dkpro-cassis.svg :alt: PyPI - License :target: https://pypi.org/project/dkpro-cassis/ .. image:: https://img.shields.io/pypi/pyversions/dkpro-cassis.svg :alt: PyPI - Python Version :target: https://pypi.org/project/dkpro-cassis/ .. image:: https://img.shields.io/pypi/v/dkpro-cassis.svg :alt: PyPI :target: https://pypi.org/project/dkpro-cassis/ .. image:: https://img.shields.io/badge/code%20style-black-000000.svg :target: https://github.com/ambv/black DKPro **cassis** (pronunciation: [ka.sis]) provides a pure-Python implementation of the *Common Analysis System* (CAS) as defined by the _ framework. The CAS is a data structure representing an object to be enriched with annotations (the co-called *Subject of Analysis*, short *SofA*). This library enables the creation and manipulation of annotated documents (CAS objects) and their associated type systems as well as loading and saving them in the _ or the _ in Python programs. This can ease in particular the integration of Python-based Natural Language Processing (e.g. _ or _) and Machine Learning librarys (e.g. _ or _) in UIMA-based text analysis workflows. An example of cassis in action is the _, which wraps the spacy NLP library as a web service which can be used in conjunction with the _ text annotation platform to automatically generate annotation suggestions. Features -------- Currently supported features are: • Text SofAs • Deserializing/serializing UIMA CAS from/to XMI • Deserializing/serializing UIMA CAS from/to JSON • Deserializing/serializing type systems from/to XML • Selecting annotations, selecting covered annotations, adding annotations • Type inheritance • Multiple SofA support • Type system can be changed after loading • Primitive and reference features and arrays of primitives and references Some features are still under development, e.g. • Proper type checking • XML/XMI schema validation Installation ------------ To install the package with :code: , just run pip install dkpro-cassis Usage ----- Example CAS XMI and types system files can be found under :code: . Reading a CAS file ~~~~~~~~~~~~~~~~~~ **From XMI:** A CAS can be deserialized from the UIMA CAS XMI (XML 1.0) format either by reading from a file or string using :code: . .. code:: python from cassis import * with open('typesystem.xml', 'rb') as f: typesystem = load_typesystem(f) with open('cas.xmi', 'rb') as f: cas = load_cas_from_xmi(f, typesystem=typesystem) **From JSON:** The UIMA JSON CAS format is also supported and can be loaded using :code: . Most UIMA JSON CAS files come with an embedded typesystem, so it is not necessary to specify one. .. code:: python from cassis import * with open('cas.json', 'rb') as f: cas = load_cas_from_json(f) Writing a CAS file ~~~~~~~~~~~~~~~~~~ **To XMI:** A CAS can be serialized to XMI either by writing to a file or be returned as a string using :code: . .. code:: python from cassis import * # Returned as a string xmi = cas.to_xmi() # Written to file cas.to_xmi("my_cas.xmi") **To JSON:** A CAS can also be written to JSON using :code: . .. code:: python from cassis import * # Returned as a string xmi = cas.to_json() # Written to file cas.to_json("my_cas.json") Creating a CAS ~~~~~~~~~~~~~~ A CAS (Common Analysis System) object typically represents a (text) document. When using cassis, you will likely most often reading existing CAS files, modify them and then writing them out again. But you can also create CAS objects from scratch, e.g. if you want to convert some data into a CAS object in order to create a pre-annotated text. If you do not have a pre-defined typesystem to work with, you will have to define one. .. code:: python typesystem = TypeSystem() cas = Cas( sofa_string = "Joe waited for the train . The train was late .", document_language = "en", typesystem = typesystem) print(cas.sofa_string) print(cas.sofa_mime) print(cas.document_language) Adding annotations ~~~~~~~~~~~~~~~~~~ **Note:** type names used below are examples only. The actual CAS files you will be dealing with will use other names! You can get a list of the types using :code: . Given a type system with a type :code: that has an :code: and :code: feature, annotations can be added in the following: .. code:: python from cassis import * with open('typesystem.xml', 'rb') as f: typesystem = load_typesystem(f) with open('cas.xmi', 'rb') as f: cas = load_cas_from_xmi(f, typesystem=typesystem) Token = typesystem.get_type('cassis.Token') tokens = [ Token(begin=0, end=3, id='0', pos='NNP'), Token(begin=4, end=10, id='1', pos='VBD'), Token(begin=11, end=14, id='2', pos='IN'), Token(begin=15, end=18, id='3', pos='DT'), Token(begin=19, end=24, id='4', pos='NN'), Token(begin=25, end=26, id='5', pos='.'), ] for token in tokens: cas.add(token) Selecting annotations ~~~~~~~~~~~~~~~~~~~~~ .. code:: python from cassis import * with open('typesystem.xml', 'rb') as f: typesystem = load_typesystem(f) with open('cas.xmi', 'rb') as f: cas = load_cas_from_xmi(f, typesystem=typesystem) for sentence in cas.select('cassis.Sentence'): for token in cas.select_covered('cassis.Token', sentence): print(token.get_covered_text()) # Annotation values can be accessed as properties print('Token: begin={0}, end={1}, id={2}, pos={3}'.format(token.begin, token.end, token.id, token.pos)) Getting and setting (nested) features ~~~~~~~~~~~~~~~~~~~~~~~~~…