v1.1.5
Version 1.1.5 internal test release
- Testing .docx and .pdf text extraction
Following from conversation about StudentTasks, a technology test was completed to create a service to be able to ansyncronously retrieve a base64 encoded document from a FileMaker record. This would be triggered after a submit action finalises a student task submission. There is no UI requirement, so status information becomes key to confirm completion, failure or stalled extractions. Paramters are {taskID, fileType}
{
"text": "extracted plain text content",
"charCount": 17961,
"durationMs": 209,
"library": "mammoth",
"version": "1.8.0",
"filename": "report.docx",
"lineBreaks": "single",
"hash": "c99b32954d120bf62ad818944660275f"
}
This has been extended to provide a micro-service, which can take parameters of type and base64encoded file, along with fileName and returnHash. As this is an open endpoint we shall be adding a state or session parameter to reduce fake attempts.
cURL https://server/extraction/api/extract/direct
{"b64": "${B64}", "fileType": "pdf", "fileName": "test4.pdf", "returnHash": true}
{
"text": "extracted text",
"charCount": 2115,
"durationMs": 241,
"library": "mammoth",
"version": "1.12.0",
"filename": "test.docx",
"hash": "c99b32954d120bf62ad818944660275f"
}
Response is very fast at sub 300ms for 4 page test docx file, and this covers:
- TLS handshake
- nginx proxy overhead and routing
- base64 decode
- mammoth parsing the DOCX XML
- JSON serialisation of the response
- network latency both ways
A playbook is written and fully tested.
- Upload → stores b64 of doc on StudentTask record
- (may repeat — each upload overwrites previous b64)
- Commit → OData PATCH (answers + timestamp + locked)
- → fire-and-forget POST to extraction service
- POST /api/extract
- Node Extraction Service (Express)
- PATCH StudentTask → extractionStatus: 'processing'
- Fetch b64 from StudentTask via OData
- Decode → Buffer
- Branch: DOCX → mammoth | PDF → pdfjs-dist
- Compute MD5 hash of original binary
- PATCH StudentTask → extracted text + metadata
- POST to FileMaker script → archive b64 to Documents FileMaker StudentTask record — extracted text, status fields Documents record — reconstituted original binary linked to StudentTask
If questions are prefixed with a known character (§) then the text can be extracted with singke carriage returns and then substitute extra lines before teh character for presentation purposes. The extraction is written as a service, so could be called from other places in the FileMaker ecosphere.