PDF Invoice Data Extraction to MongoDB – Node.js & React with OpenAI

PDF Invoice Data Extraction to MongoDB – Node.js & React with OpenAI
PDF Invoice Data Extraction to MongoDB – Node.js & React with OpenAI

Invoice Data Extraction – PDF to Excel , JSON and DB

A full-stack application for extracting structured data from invoices using OpenAI’s GPT models. Built with Node.js backend and React frontend, featuring multi-language support and automated database storage.

DISCLAIMER:- This item uses third-party AI services (such as OpenAI) which are not included in the purchase price. Buyers are responsible for providing their own API keys and covering any usage costs charged by these services. No AI credits, subscriptions, or usage fees are included with this item.

Features

  • AI-Powered Extraction: Uses OpenAI GPT-4o-mini for accurate invoice data extraction with structured JSON output
  • Multi-Format Support: Processes text, PDF (using pdf-parse), and image files (using Tesseract.js OCR)
  • Database Automation: Automatically stores extracted data in MongoDB with Mongoose ODM
  • Multi-Language Support: Built-in internationalization with English and Spanish translations
  • RESTful API: Clean API endpoints for invoice management with proper error handling
  • Data Validation: Comprehensive validation service with fallbacks and data cleaning
  • File Upload: Multi-file upload support (up to 5 files) with drag-and-drop interface
  • Export Functionality: Export extracted data to Excel format
  • Unit Testing: Comprehensive test coverage with Jest and Supertest
  • Modern UI: React-based frontend with responsive design and Tailwind CSS
  • Text Preprocessing: Intelligent text preprocessing to handle OCR artifacts and formatting issues

Tech Stack

Backend

  • Node.js with Express.js
  • MongoDB with Mongoose ODM
  • OpenAI API for data extraction
  • JWT for authentication (optional)
  • Jest & Supertest for testing

Frontend

  • React with modern hooks
  • Axios for API communication
  • i18next for internationalization
  • React Router for navigation
  • Testing Library for component testing
  • Tailwind CSS for styling
  • XLSX for Excel export functionality

Architecture Overview

Backend Architecture

Data Flow

  1. File Upload: User uploads invoice files (PDF, image, text)
  2. Text Extraction: Files are processed using pdf-parse or Tesseract.js OCR
  3. AI Processing: Extracted text is sent to OpenAI with structured prompts
  4. Data Validation: AI response is validated and cleaned
  5. Database Storage: Structured data is saved to MongoDB
  6. Frontend Display: Data is displayed in a responsive table with export options

Project Structure

invoice-extraction/ ├── backend/ # Node.js backend │ ├── models/ # Mongoose models │ ├── routes/ # API routes │ ├── services/ # Business logic services │ ├── __tests__/ # Unit tests │ ├── db.js # Database connection │ └── index.js # Server entry point ├── frontend/ # React frontend │ ├── src/ │ │ ├── components/ # React components │ │ ├── i18n/ # Internationalization setup │ │ └── __tests__/ # Component tests ├── prompts/ # OpenAI prompt templates ├── translations/ # Language files ├── docs/ # Documentation ├── .env # Environment variables └── README.md # This file 

Prerequisites

  • Node.js (v16 or higher)
  • MongoDB (local or cloud instance)
  • OpenAI API key

Installation

  1. Clone the repository

    git clone <repository-url > cd invoice-extraction 
  2. Install backend dependencies

    cdbackend npm install 
  3. Install frontend dependencies

    cd../frontend npm install cd.. 
  4. Environment Setup

    • Copy .env file and update the values:
      cp .env .env.local 
    • Update the following variables:
      • OPENAI_API_KEY: Your OpenAI API key
      • MONGO_URI: MongoDB connection string
      • PORT: Server port (default: 5000)
  5. Start MongoDB Make sure MongoDB is running on your system or update MONGO_URI for cloud instance.

Usage

Development

  1. Start the backend server

    cd backend npm run dev 
  2. Start the frontend

    cd frontend npm start 
  3. Access the application

    • Frontend: http://localhost:3000
    • Backend API: http://localhost:5000

Production

  1. Build the frontend

    cd frontend npm run build 
  2. Start the backend

    cd backend npm start 

API Documentation

Invoice Endpoints

Upload Invoice

POST /api/invoices/upload Content-Type: multipart/formdata Form Data: - invoice: File (text, PDF, or image) 

Response:

{ "message": "Invoice processed successfully", "invoice": { "vendor":"Vendor Name", "invoiceNumber": "INV-001", "date": "2023-01-01T00:00:00.000Z", "totalAmount": 100.50, "currency">: "USD", "items": [...], "status": "processed" } } 

Get All Invoices

GET <span class="hljs-regexp">/api/i</span>nvoices 

Get Invoice by ID

GET /api/invoices/:<span class="hljs-built_in">id</span> 

Delete Invoice

<span class="hljs-keyword">DELETE</span> <span class="hljs-regexp">/api/i</span>nvoices<span class="hljs-regexp">/:id</span> 

AI Processing & Data Extraction

OpenAI Integration

The system uses OpenAI’s GPT-4o-mini model with structured JSON output to ensure consistent data extraction. The AI is prompted with:

  • System Prompt: Defines the AI’s role as an invoice data extraction expert
  • User Prompt: Provides the extracted text and specifies the exact JSON format required
  • JSON Schema: Enforces structured output with validation rules

Text Extraction Process

  1. PDF Files: Processed using pdf-parse library to extract text content
  2. Image Files: OCR processing using Tesseract.js with optimized parameters
  3. Text Files: Direct UTF-8 text extraction
  4. Preprocessing: Text cleaning to handle OCR artifacts and formatting issues

Data Validation & Cleaning

  • Schema Validation: Ensures all required fields are present and properly formatted
  • Fallback Values: Provides sensible defaults for missing data
  • Type Conversion: Validates dates, amounts, and other data types
  • Duplicate Prevention: Uses invoice number as unique identifier for upsert operations

Supported Invoice Fields

  • Vendor/Supplier information
  • Invoice number and dates
  • Financial amounts (total, subtotal, tax, discounts, shipping)
  • Customer and shipping details
  • Line items with descriptions, quantities, and pricing
  • Payment terms and currency information

Testing

Backend Tests

<span class="hljs-built_in">cd</span> backend npm <span class="hljs-built_in">test</span> 

Frontend Tests

<span class="hljs-built_in">cd</span> frontend npm <span class="hljs-built_in">test</span> 

Multi-Language Support

The application supports multiple languages through JSON-based translations.

Adding a New Language

  1. Create a new translation file in translations/ directory
  2. Update the i18n configuration in frontend/src/i18n/index.js
  3. Add language option in the UI

Current Languages

  • English (en)
  • Spanish (es)

Configuration

All configuration is managed through environment variables in the .env file:

  • PORT: Server port
  • MONGO_URI: MongoDB connection string
  • OPENAI_API_KEY: OpenAI API key
  • JWT_SECRET: JWT secret for authentication
  • MAX_FILE_SIZE: Maximum file upload size
  • DEFAULT_LANGUAGE: Default application language

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests for new functionality
  5. Submit a pull request

Development Guidelines

Code Style

  • Backend: Follow Node.js best practices with async/await patterns
  • Frontend: Use React functional components with hooks
  • Naming: Use camelCase for variables/functions, PascalCase for components
  • Error Handling: Implement proper try-catch blocks and error responses
  • Comments: Add JSDoc comments for functions and complex logic

Environment Variables

Create a .env file in the root directory with:

# OpenAI Configuration OPENAI_API_KEY =your_openai_api_key_here # Database Configuration MONGO_URI=mongodb://localhost:27017/invoice-extraction # Server Configuration PORT5000 JWT_SECRET=your_jwt_secret_here # File Upload Configuration MAX_FILE_SIZE=10485760 DEFAULT_LANGUAGE=en 

Running Tests

# Backend tests cd backend && npm =test # Frontend tests >cd frontend && npm test 

Adding New Features

  1. Create a feature branch from main
  2. Implement the feature with proper error handling
  3. Add unit tests for new functionality
  4. Update documentation if needed
  5. Submit a pull request

Deployment

Backend Deployment

  1. Environment Setup: Configure production environment variables
  2. Database: Set up MongoDB instance (local or cloud)
  3. Build: Run npm install --production for dependencies
  4. Start: Use npm start or process manager like PM2

Frontend Deployment

  1. Build: Run npm run build in frontend directory
  2. Serve: Deploy built files to web server (nginx, Apache, etc.)
  3. API Configuration: Update API endpoints for production

Docker Deployment (Optional)

# Example Dockerfile for backend FROM node:16-alpine WORKDIR /app COPY package*.json ./ RUN npm ci --only=production COPY . . EXPOSE 5000 CMD ["npm", "start"] 

Production Considerations

  • Security: Use HTTPS, validate inputs, rate limiting
  • Monitoring: Implement logging and error tracking
  • Scalability: Consider load balancing for high traffic
  • Backup: Regular database backups
  • Updates: Keep dependencies updated and monitor for vulnerabilities

Troubleshooting

Common Issues

  • OpenAI API Errors: Check API key and quota limits
  • MongoDB Connection: Verify connection string and network access
  • File Upload Issues: Check file size limits and supported formats
  • OCR Problems: Ensure Tesseract.js is properly installed

Debug Mode

Set NODE_ENV=development for detailed error logging and debugging information.

PDF Invoice Data Extraction to MongoDB – Node.js & React with OpenAI

average based on 0 ratings.