Usage

mpiQulacs has the same API interface as Qulacs with some exceptions.
You can write quantum programs just like Qulacs.
See also  Qulacs documents for descriptions and code examples for each API.

Example

from qulacs import QuantumCircuit, QuantumState
from qulacs.circuit import QuantumCircuitOptimizer
from qulacs.gate import Y,CNOT,merge
from mpi4py import MPI  # import is required for MPI execution even if MPI classes are not used

comm = MPI.COMM_WORLD
mpirank = comm.Get_rank()

# Define quantum states and initialize them with random values
state = QuantumState(3, use_multi_cpu=True)
state.set_Haar_random_state()

# Define quantum circuits and update quantum states
circuit = QuantumCircuit(3)
circuit.add_X_gate(0)
merged_gate = merge(CNOT(0, 1), Y(1))
circuit.add_gate(merged_gate)
circuit.add_RX_gate(1, 0.5)

# quantum circuit optimization
Opt_type = "Heavy"
if Opt_type == "Light":
  # Optimize using SWAP (reduces inter-node communication in update processing)
  swap_level = 1
  # Quantum gates of a quantum circuit is merged by a greedy method
  # until the target quantum bit becomes the specified size.
  if state.get_device_name() == "multi-cpu":
    QuantumCircuitOptimizer().optimize_light(circuit, swap_level)
  else:
    QuantumCircuitOptimizer().optimize_light(circuit)
elif Opt_type == "Heavy":
  # Maximum quantum gate size allowed to be merged
  max_block_size = 2
  # More powerful SWAP/FusedSWAP optimizations than swap_level = 1
  swap_level = 2
  # The optimisation also takes into account the exchange of exchangeable quantum gates.
  # If the quantum circuit size is large, the optimisation process time may increase significantly.
  if state.get_device_name() == "multi-cpu":
    QuantumCircuitOptimizer().optimize(circuit, max_block_size, swap_level)
  else:
    QuantumCircuitOptimizer().optimize(circuit, max_block_size)

# Update states using circuit
circuit.update_quantum_state(state)

# Sampling from quantum states
samples = state.sampling(20, 2022)  # with random seed = 2022
if mpirank == 0:
  print(samples)

The second argument to QuantumState must be use_multi_cpu=True . (This distributes state vectors across multiple compute nodes, enabling parallel processing by MPI.)

List of APIs with functional differences from Qulacs

QuantumState class

QuantumState(qubits, use_multi_cpu)
- use_multi_cpu = False
  - Positions the entire state vector within one node (same behavior as Qulacs)
- use_multi_cpu = True
  - Distributes the state vector across multiple nodes.
    
    However, (N - k) ≦ log₂(S) does not distribute. (S is the number of MPI ranks, N is the number of qubits, and k is the minimum number of qubits per process (constant k = 2).)
  - The qubit is classified as local qubit and global qubit and is arranged as follows:
    - local qubit: Placed in same node
    - global qubit: Distributed across multiple nodes
      - The top log₂(number of ranks) qubits are classified as global qubits

state.get_device_name()

Returns the device where the state vector is located.

Return value	Explanation
“cpu”	State vector created in 1 node
“multi-cpu”	State vector distributed across nodes (ranks)
(“gpu”)	Qulacs means that the state vector is placed on the GPU, but mpiQulacs does not support placement on the GPU.

state.load(vector)
- When state vectors are distributed, only the vector elements of the entire state vector that are placed at each rank must be specified as arguments. (For calls to state.load with rank N, only vector elements that are placed in rank N should be specified as arguments to the load method.) If you want to load the entire state vector as an argument, see Method for loading/getting entire distributed state vector.
state.get_vector()
- If the state vectors are distributed, returns only the state vector of each rank (that is, a portion of the entire state vector). If you want to get the entire state vector as a return value, see Method for loading/getting entire distributed state vector .

state.to_string()

For a state vector with a large number of elements, the first 256 elements are output.
If the state vectors are distributed, the first part of each rank’s state vector (part of the whole state vector) is output.

The number of local qubit and global qubit is also output.

Sample output

-- rank 0 --------------------------------------
*** Quantum State ***
* MPI rank / size : 0 / 2
* Qubit Count : 20 (local / global : 19 / 1 )
* Dimension   : 262144
* state vector is too long, so the (128 x 2) elements are output.
* State vector (rank 0):
(1,0)
...
-- rank 1 --------------------------------------
* State vector (rank 1):
(0,0)
...

state.set_Haar_random_state([seed])
- Initializes each element of the state vector with a random number.
- If the state vector is distributed,
  - If you do not specify seed, the random seed value generated by rank 0 is first shared among all ranks.
  - Based on the specified or shared seed, each rank uses (seed + rank number) as the final seed value. Therefore, if the state vectors are distributed, the state vectors generated will be different if the number of divisions is different, even if the same seed value is specified for the argument.
state.sample(number_sampling [, seed])
- If seed is not specified, a random seed value generated at rank 0 is shared among all ranks and used as the seed value.

FusedSWAP(qubit_idx1, qubit_idx2, block_size)
- Swaps the placement of block_size qubits from the index specified in qubit_idx1 with the placement of block_size qubits from the index specified in qubit_idx2.
  - The effect of the gate is equivalent to continuously applying a SWAP gate to the range of qubit index being swapped.
    
    When a FusedSWAP gate is used, its replacement process is completed with one gate.
- The primary use of FusedSWAP is to speed up gate operations when state vectors are distributed.
- As the qubit count increases, the information in the top log₂(number of ranks) qubits is distributed across the ranks. Quantum gate operations on qubits (global qubits) that are placed across ranks are slower than operations on qubits (local qubits) that are placed in the same rank because they require communication between compute nodes.
  
  If you have a lot of gate operations on the global qubit, you can reduce the gate operations on qubits across the ranks by using a FusedSWAP gate to swap the placement of the global qubit and the local qubit. This reduces the amount of communication during gate operation and speeds up the operation.
- Please also refer to Example of use of FusedSWAP .
Specifying seed in update_quantum_state
- QuantumGate.update_quantum_state(state, seed)
- QuantumCircuit.update_quantum_state(state, seed)
  
  Random numbers used in gates using random numbers such as Measurement can be fixed. Once seed is specified, the random number generator initialized by seed is inherited.
Automatic FusedSWAP gate insertion of QuantumCircuitOptimizer
- Add the swap_level argument to the optimize method of QuantumCircuitOptimizer. By specifying swap_level = 1, SWAP/FusedSWAP gates are automatically inserted to reduce gate operations on the global qubit. This reduces the amount of communication during gate operation and increases the speed.
- optimize(circuit, block_size, swap_level=0)
- optimize_light(circuit, swap_level=0)
  - swap_level = 0
    - Do not insert SWAP/FusedSWAP gate.
  - swap_level = 1
    - Inserting SWAP/FusedSWAP gate reduces and speeds up inter-node communication in update operations.
  - swap_level = 2
    - Insert SWAP/FusedSWAP gates while changing the gate order. You might be able to reduce traffic even further than swap_level = 1.
    In mpiQulacs v1.3.0 and later, it is possible to perform optimization on a gate-by-gate basis and optimization using FusedSWAP at the same time.

Note

Basically, every method has to be run the same way in every rank.
(Do not perform a gate operation or call the sampling method in a program block that behaves differently depending on its rank, such as if rank == 0 .)
If you specify seed, you must specify a value that is common to all ranks.
Please also refer to item Notes on programming in System Usage.

Note

The FusedSWAP gate and QuantumCircuitOptimizer’s automatic FusedSWAP gate insertion (swap_level > 1) have different effects depending on the structure of the quantum circuit and the number of MPI parallelism.

We recommend that you compare and consider the performance with and without them.

Environment variable settings

QULACS_NUM_THREADS (1 - 1024)
- Specifies the maximum number of threads to use with mpiQulacs. Takes precedence over OMP_NUM_THREADS.

Example of use of FusedSWAP

In some cases, you can use FusedSWAP to speed up the simulation by reducing gate operations on qubits across ranks.

The following code inserts FusedSWAP before performing any operations on the global qubit to prevent gate operations on qubits across ranks.

import numpy as np
from qulacs import QuantumCircuit, QuantumState
from mpi4py import MPI

comm = MPI.COMM_WORLD
mpirank = comm.Get_rank()
mpisize = comm.Get_size()
np.random.seed(seed=32)

def build_circuit(depth, nqubits, mpisize):
    global_qc = int(np.log2(mpisize))
    local_qc = nqubits - global_qc
    swapped = False

    circuit = QuantumCircuit(nqubits)
    for _ in range(depth):
        # operation on local qubit
        for i in range(local_qc):
            circuit.add_H_gate(i)

        # Swap the placement of global qubit and local qubit before operating on global qubit
        # (Place global qubits in the same rank)
        if global_qc != 0:
            circuit.add_FusedSWAP_gate(local_qc - global_qc, local_qc, global_qc)
            swapped = not swapped

        # operation on global qubit
        for i in range(global_qc):
            circuit.add_H_gate(local_qc - global_qc + i)

    # In the for statement above, we added FusedSWAP depth times,
    # we added FusedSWAP as needed to return to the original qubit sequence.
    if (global_qc != 0) and swapped:
        circuit.add_FusedSWAP_gate(local_qc - global_qc, local_qc, global_qc)

    return circuit

if __name__ == '__main__':
    nqubits = 36
    st = QuantumState(nqubits, use_multi_cpu=True)
    circuit = build_circuit(10, nqubits, mpisize)
    circuit.update_quantum_state(st)

Method for loading/getting entire distributed state vector

As described in List of APIs with functional differences from Qulacs , QuantumState.load(vector), QuantumState.get_vector() processes only those vector elements that are placed at each rank in the entire state vector.

If you want to load/get_vector the entire state vector as an argument/return value, define a method like this and use it instead of load, get_vector:.

However, note the following:.
As the qubit number increases, the entire state vector becomes too large to fit in one node of memory.
For this reason, the state vector is split into multiple nodes, and the load/get_vector method only processes the vector elements that each node is responsible for.
Note that if you define a method such as the one below and try to process the entire state vector within one node, a large number of qubits (close to 30 qubits) can cause a program crash due to lack of memory.

import numpy as np
from mpi4py import MPI

def load_state_vector(state, vector):
    """
    Loads the given entire state vector into the given state.

    Args:
        state (qulacs.QuantumState): a quantum state
        vector: a state vector to load
    """
    if state.get_device_name() == 'multi-cpu':
        mpicomm = MPI.COMM_WORLD
        mpirank = mpicomm.Get_rank()
        mpisize = mpicomm.Get_size()
        vector_len = len(vector)
        idx_start = vector_len // mpisize * mpirank
        idx_end = vector_len // mpisize * (mpirank + 1)
        state.load(vector[idx_start:idx_end])
    else:
        state.load(vector)

def get_state_vector(state):
    """
    Gets the entire state vector from the given state.

    Args:
        state (qulacs.QuantumState): a quantum state
    Return:
        vector: a state vector
    """
    if state.get_device_name() == 'multi-cpu':
        mpicomm = MPI.COMM_WORLD
        mpisize = mpicomm.Get_size()
        vec_part = state.get_vector()
        len_part = len(vec_part)
        vector_len = len_part * mpisize
        vector = np.zeros(vector_len, dtype=np.complex128)
        mpicomm.Allgather([vec_part, MPI.DOUBLE_COMPLEX],
                          [vector, MPI.DOUBLE_COMPLEX])
        return vector
    else:
        return state.get_vector()